Project

General

Profile

The ungleich monitoring infrastructure » History » Version 26

Nico Schottelius, 08/13/2020 10:56 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 23 Nico Schottelius
h3. Architecture overview
10
11
* There is *1 internal IPv6 only* monitoring system *per place*
12
** emonitor1.place5.ungleich.ch (real hardware)
13
** emonitor1.place6.ungleich.ch (real hardware)
14
** *Main role: alert if services are down*
15
* There is *1 external dual stack* monitoring system
16
** monitoring.place4.ungleich.ch
17
** *Main role: alert if one or more places are unreachable from outside*
18 25 Nico Schottelius
** Also monitors all nodes to be have all data available
19 23 Nico Schottelius
* There are *many monitored* systems
20
* Systems can be marked as intentionally down (but still kept monitored)
21 24 Nico Schottelius
* Monitoring systems are built with the least amount of external dependencies
22 23 Nico Schottelius
23 26 Nico Schottelius
h3. Monitoring and Alerting workflow
24
25
* Once per day the SRE team checks the relevant dashboards
26
** Are systems down that should not be?
27
** Is there a trend visible of systems failing?
28
* If the monitoring system sent a notification about a failed system
29
** The SRE team fixes it the same day if possible
30
* If the monitoring system sent a critical error message
31
** Instant fixes are to be applied by the SRE team
32
33 25 Nico Schottelius
h3. Adding a new production system
34
35
* Install the correct exporter (often: node_exporter)
36
* Limit access via nftables
37 23 Nico Schottelius
38 7 Nico Schottelius
h3. Configuring prometheus
39
40
Use @promtool check config@ to verify the configuration.
41
42
<pre>
43
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
44
Checking /etc/prometheus/prometheus.yml
45
  SUCCESS: 4 rule files found
46
47
Checking /etc/prometheus/blackbox.rules
48
  SUCCESS: 3 rules found
49
50
Checking /etc/prometheus/ceph-alerts.rules
51
  SUCCESS: 8 rules found
52
53
Checking /etc/prometheus/node-alerts.rules
54
  SUCCESS: 8 rules found
55
56
Checking /etc/prometheus/uplink-monitoring.rules
57
  SUCCESS: 1 rules found
58
59
</pre>
60 8 Nico Schottelius
61
h3. Querying prometheus
62
63
Use @promtool query instant@ to query values:
64
65
<pre>
66
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
67
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
68
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
69
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
70
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
71
</pre>
72 9 Nico Schottelius
73 11 Nico Schottelius
Typical queries:
74
75
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
76
77
<pre>
78
sum by (job) (probe_success)
79
80
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
81
'
82
{job="routers-place5"} => 4 @[1593961699.969]
83
{job="uplink-place5"} => 4 @[1593961699.969]
84
{job="routers-place6'"} => 4 @[1593961699.969]
85
{job="uplink-place6"} => 4 @[1593961699.969]
86
{job="core-services"} => 3 @[1593961699.969]
87
[17:08:19] server1.place11:/etc/prometheus# 
88
89
</pre>
90
91
92
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
93
94
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
95
96
The operator @on@ is used to filter
97
98
<pre>
99
sum(probe_success * on(instance) probe_ip_protocol == 4)
100
</pre>
101
102
103
Creating an alert:
104
105
* if the sum of all jobs of a certain regex and match on ip protocol is 0
106
** this particular job indicates total loss of connectivity
107
* We want to get a vector like this:
108
** job="routers-place5", protocol = 4 
109
** job="uplink-place5", protocol = 4 
110
** job="routers-place5", protocol = 6 
111
** job="uplink-place5", protocol = 6
112
113 9 Nico Schottelius
114 12 Nico Schottelius
Query for IPv4 of all routers:
115
116
<pre>
117
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
118
{job="routers-place5"} => 8 @[1593963562.281]
119
{job="routers-place6'"} => 8 @[1593963562.281]
120
</pre>
121
122
Query for all IPv4 of all routers:
123
124
<pre>
125
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
126
{job="routers-place5"} => 12 @[1593963626.483]
127
{job="routers-place6'"} => 12 @[1593963626.483]
128
[17:40:26] server1.place11:/etc/prometheus# 
129
</pre>
130
131
Query for all IPv6 uplinks:
132
133
<pre>
134
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
135
{job="uplink-place5"} => 12 @[1593963675.835]
136
{job="uplink-place6"} => 12 @[1593963675.835]
137
[17:41:15] server1.place11:/etc/prometheus# 
138
</pre>
139
140
141
Query for all IPv4 uplinks:
142
143
<pre>
144
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
145
{job="uplink-place5"} => 8 @[1593963698.108]
146
{job="uplink-place6"} => 8 @[1593963698.108]
147
148
</pre>
149
150
The values 8 and 12 means:
151
152
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
153
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
154
155
To normalise, we would need to divide by 4 (or 6):
156
157
<pre>
158
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
159
{job="uplink-place5"} => 2 @[1593963778.885]
160
{job="uplink-place6"} => 2 @[1593963778.885]
161
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
162
{job="uplink-place5"} => 2 @[1593963788.276]
163
{job="uplink-place6"} => 2 @[1593963788.276]
164
</pre>
165
166
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
167
168 9 Nico Schottelius
h3. Using Grafana
169
170
* Username for changing items: "admin"
171
* Username for viewing dashboards: "ungleich"
172
* Passwords in the password store
173 10 Nico Schottelius
174
h3. Managing alerts
175
176
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
177
* Use @amtool@
178
179
Showing current alerts:
180
181
<pre>
182 19 Nico Schottelius
# Alpine needs URL (why?)
183
amtool alert query --alertmanager.url=http://localhost:9093
184
185
# Debian
186
amtool alert query
187
</pre>
188
189
190
<pre>
191 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
192
Alertname            Starts At                 Summary                                                               
193
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
194
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
195
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
196
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
197
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
198
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
199
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
200
[14:54:41] monitoring.place6:~# 
201
</pre>
202
203
Silencing alerts:
204
205
<pre>
206
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
207
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
208
[15:00:06] monitoring.place6:~# amtool silence query
209
ID                                    Matchers                  Ends At                  Created By  Comment                
210
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
211
[15:00:13] monitoring.place6:~# 
212
</pre>
213
214 1 Dominique Roux
Better using author and co. TOBEFIXED
215 13 Nico Schottelius
216
h3. Severity levels
217
218
The following notions are used:
219 1 Dominique Roux
220 14 Nico Schottelius
* critical = panic = calling to the whole team
221 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
222 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
223 15 Nico Schottelius
224
h3. Labeling
225
226
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
227
228
* The @relabel_configs@ are applied BEFORE scraping
229
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
230
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
231
* metric_label_config does not apply to automatic labels like @up@ !
232 1 Dominique Roux
** You need to use relabel_configs
233 16 Nico Schottelius
234
h3. Setting "roles"
235
236
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
237
238
<pre>
239
    relabel_configs:
240 1 Dominique Roux
      - source_labels: [__address__]
241
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
242
        target_label:  'role'
243
        replacement:   '$1'
244
      - source_labels: [__address__]
245
        regex:         'ciara.*.ungleich.ch.*'
246
        target_label:  'role'
247
        replacement:   'server'
248
      - source_labels: [__address__]
249
        regex:         '.*:9283'
250
        target_label:  'role'
251
        replacement:   'ceph'
252
      - source_labels: [__address__]
253
        regex:         '((ciara2|ciara4).*)'
254
        target_label:  'role'
255
        replacement:   'down'
256
      - source_labels: [__address__]
257
        regex:         '.*(place.*).ungleich.ch.*'
258
        target_label:  'dc'
259
        replacement:   '$1'
260
</pre>
261
262
What happens here:
263
264
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
265
* We apply some roles by default (the server, monitor etc.)
266
* Special rule for ciara, which does not match the serverX pattern
267
* ciara2 and ciara4 in above example are intentionally down
268
* At the end we setup the "dc" label in case the host is in a place of ungleich
269
270
h3. Marking hosts down
271
272
If a host or service is intentionally down, **change its role** to **down**.
273
274
h3. SMS and Voice notifications
275
276
We use https://ecall.ch.
277
278
* For voice: mail to number@voice.ecall.ch
279
* For voice: mail to number@sms.ecall.ch
280
281
Uses email sender based authorization.
282
283
h3. Alertmanager clusters
284
285 21 Nico Schottelius
* The outside monitors form one alertmanager cluster
286
* The inside monitors form one alertmanager cluster
287
288
h3. Monit
289
290
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
291
292
h3. Service/Customer monitoring
293
294
* A few blackbox things can be found on the datacenter monitoring infrastructure.
295
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
296 20 Nico Schottelius
297
h2. Old Monitoring
298
299
Before 2020-07 our monitoring incorporated more services/had a different approach:
300
301
302
We used the following technology / products for the monitoring:
303
304
* consul (service discovery)
305
* prometheus (exporting, gathering, alerting)
306
* Grafana (presenting)
307
308
Prometheus and grafana are located on the monitoring control machines
309
310
* monitoring.place5.ungleich.ch
311
* monitoring.place6.ungleich.ch
312
313
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
314
315
316
h3. Consul
317
318
We used a consul cluster for each datacenter (e.g. place5 and place6). 
319
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
320
321
consul is configured to publish the service its host is providing (e.g. the exporters)
322
323
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
324
325
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
326
327
h3. Authentication
328
329
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
330
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
331
332
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.