Project

General

Profile

The ungleich monitoring infrastructure » History » Version 32

Dominique Roux, 05/07/2021 08:38 AM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 23 Nico Schottelius
h3. Architecture overview
10
11
* There is *1 internal IPv6 only* monitoring system *per place*
12
** emonitor1.place5.ungleich.ch (real hardware)
13
** emonitor1.place6.ungleich.ch (real hardware)
14
** *Main role: alert if services are down*
15
* There is *1 external dual stack* monitoring system
16
** monitoring.place4.ungleich.ch
17
** *Main role: alert if one or more places are unreachable from outside*
18 25 Nico Schottelius
** Also monitors all nodes to be have all data available
19 27 Nico Schottelius
* There is *1 customer enabled* monitoring system
20
** monitoring-v3.ungleich.ch
21
** Uses LDAP
22
** Runs on a VM
23 23 Nico Schottelius
* There are *many monitored* systems
24
* Systems can be marked as intentionally down (but still kept monitored)
25 24 Nico Schottelius
* Monitoring systems are built with the least amount of external dependencies
26 23 Nico Schottelius
27 26 Nico Schottelius
h3. Monitoring and Alerting workflow
28
29
* Once per day the SRE team checks the relevant dashboards
30
** Are systems down that should not be?
31
** Is there a trend visible of systems failing?
32
* If the monitoring system sent a notification about a failed system
33
** The SRE team fixes it the same day if possible
34
* If the monitoring system sent a critical error message
35
** Instant fixes are to be applied by the SRE team
36
37 25 Nico Schottelius
h3. Adding a new production system
38
39
* Install the correct exporter (often: node_exporter)
40
* Limit access via nftables
41 23 Nico Schottelius
42 7 Nico Schottelius
h3. Configuring prometheus
43
44
Use @promtool check config@ to verify the configuration.
45
46
<pre>
47
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
48
Checking /etc/prometheus/prometheus.yml
49
  SUCCESS: 4 rule files found
50
51
Checking /etc/prometheus/blackbox.rules
52
  SUCCESS: 3 rules found
53
54
Checking /etc/prometheus/ceph-alerts.rules
55
  SUCCESS: 8 rules found
56
57
Checking /etc/prometheus/node-alerts.rules
58
  SUCCESS: 8 rules found
59
60
Checking /etc/prometheus/uplink-monitoring.rules
61
  SUCCESS: 1 rules found
62
63
</pre>
64 8 Nico Schottelius
65 32 Dominique Roux
h3. Configuring emonitors:
66
67
<pre>
68
cdist config -bj7 -p3 -vv emonitor1.place{5,6,7}.ungleich.ch
69
</pre>
70
71 8 Nico Schottelius
h3. Querying prometheus
72
73
Use @promtool query instant@ to query values:
74
75
<pre>
76
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
77
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
78
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
79
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
80
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
81
</pre>
82 9 Nico Schottelius
83 11 Nico Schottelius
Typical queries:
84
85
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
86
87
<pre>
88
sum by (job) (probe_success)
89
90
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
91
'
92
{job="routers-place5"} => 4 @[1593961699.969]
93
{job="uplink-place5"} => 4 @[1593961699.969]
94
{job="routers-place6'"} => 4 @[1593961699.969]
95
{job="uplink-place6"} => 4 @[1593961699.969]
96
{job="core-services"} => 3 @[1593961699.969]
97
[17:08:19] server1.place11:/etc/prometheus# 
98
99
</pre>
100
101
102
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
103
104
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
105
106
The operator @on@ is used to filter
107
108
<pre>
109
sum(probe_success * on(instance) probe_ip_protocol == 4)
110
</pre>
111
112
113
Creating an alert:
114
115
* if the sum of all jobs of a certain regex and match on ip protocol is 0
116
** this particular job indicates total loss of connectivity
117
* We want to get a vector like this:
118
** job="routers-place5", protocol = 4 
119
** job="uplink-place5", protocol = 4 
120
** job="routers-place5", protocol = 6 
121
** job="uplink-place5", protocol = 6
122
123 9 Nico Schottelius
124 12 Nico Schottelius
Query for IPv4 of all routers:
125
126
<pre>
127
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
128
{job="routers-place5"} => 8 @[1593963562.281]
129
{job="routers-place6'"} => 8 @[1593963562.281]
130
</pre>
131
132
Query for all IPv4 of all routers:
133
134
<pre>
135
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
136
{job="routers-place5"} => 12 @[1593963626.483]
137
{job="routers-place6'"} => 12 @[1593963626.483]
138
[17:40:26] server1.place11:/etc/prometheus# 
139
</pre>
140
141
Query for all IPv6 uplinks:
142
143
<pre>
144
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
145
{job="uplink-place5"} => 12 @[1593963675.835]
146
{job="uplink-place6"} => 12 @[1593963675.835]
147
[17:41:15] server1.place11:/etc/prometheus# 
148
</pre>
149
150
151
Query for all IPv4 uplinks:
152
153
<pre>
154
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
155
{job="uplink-place5"} => 8 @[1593963698.108]
156
{job="uplink-place6"} => 8 @[1593963698.108]
157
158
</pre>
159
160
The values 8 and 12 means:
161
162
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
163
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
164
165
To normalise, we would need to divide by 4 (or 6):
166
167
<pre>
168
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
169
{job="uplink-place5"} => 2 @[1593963778.885]
170
{job="uplink-place6"} => 2 @[1593963778.885]
171
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
172
{job="uplink-place5"} => 2 @[1593963788.276]
173
{job="uplink-place6"} => 2 @[1593963788.276]
174
</pre>
175
176
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
177
178 9 Nico Schottelius
h3. Using Grafana
179
180
* Username for changing items: "admin"
181
* Username for viewing dashboards: "ungleich"
182
* Passwords in the password store
183 10 Nico Schottelius
184
h3. Managing alerts
185
186
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
187
* Use @amtool@
188
189
Showing current alerts:
190
191
<pre>
192 19 Nico Schottelius
# Alpine needs URL (why?)
193
amtool alert query --alertmanager.url=http://localhost:9093
194
195
# Debian
196
amtool alert query
197
</pre>
198
199
200
<pre>
201 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
202
Alertname            Starts At                 Summary                                                               
203
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
204
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
205
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
206
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
207
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
208
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
209
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
210
[14:54:41] monitoring.place6:~# 
211
</pre>
212
213
Silencing alerts:
214
215
<pre>
216
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
217
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
218
[15:00:06] monitoring.place6:~# amtool silence query
219
ID                                    Matchers                  Ends At                  Created By  Comment                
220
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
221
[15:00:13] monitoring.place6:~# 
222
</pre>
223
224 1 Dominique Roux
Better using author and co. TOBEFIXED
225 13 Nico Schottelius
226
h3. Severity levels
227
228
The following notions are used:
229 1 Dominique Roux
230 14 Nico Schottelius
* critical = panic = calling to the whole team
231 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
232 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
233 15 Nico Schottelius
234
h3. Labeling
235
236
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
237
238
* The @relabel_configs@ are applied BEFORE scraping
239
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
240
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
241
* metric_label_config does not apply to automatic labels like @up@ !
242 1 Dominique Roux
** You need to use relabel_configs
243 16 Nico Schottelius
244
h3. Setting "roles"
245
246
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
247
248
<pre>
249
    relabel_configs:
250 1 Dominique Roux
      - source_labels: [__address__]
251
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
252
        target_label:  'role'
253
        replacement:   '$1'
254
      - source_labels: [__address__]
255
        regex:         'ciara.*.ungleich.ch.*'
256
        target_label:  'role'
257
        replacement:   'server'
258
      - source_labels: [__address__]
259
        regex:         '.*:9283'
260
        target_label:  'role'
261
        replacement:   'ceph'
262
      - source_labels: [__address__]
263
        regex:         '((ciara2|ciara4).*)'
264
        target_label:  'role'
265
        replacement:   'down'
266
      - source_labels: [__address__]
267
        regex:         '.*(place.*).ungleich.ch.*'
268
        target_label:  'dc'
269
        replacement:   '$1'
270
</pre>
271
272
What happens here:
273
274
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
275
* We apply some roles by default (the server, monitor etc.)
276
* Special rule for ciara, which does not match the serverX pattern
277
* ciara2 and ciara4 in above example are intentionally down
278
* At the end we setup the "dc" label in case the host is in a place of ungleich
279
280
h3. Marking hosts down
281
282
If a host or service is intentionally down, **change its role** to **down**.
283
284
h3. SMS and Voice notifications
285
286
We use https://ecall.ch.
287
288
* For voice: mail to number@voice.ecall.ch
289
* For voice: mail to number@sms.ecall.ch
290
291
Uses email sender based authorization.
292
293
h3. Alertmanager clusters
294
295 21 Nico Schottelius
* The outside monitors form one alertmanager cluster
296
* The inside monitors form one alertmanager cluster
297
298
h3. Monit
299
300
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
301
302
h3. Service/Customer monitoring
303
304
* A few blackbox things can be found on the datacenter monitoring infrastructure.
305
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
306 27 Nico Schottelius
307 30 Nico Schottelius
h2. Monitoring Rules
308
309
The following is a description of logical rules that (are, need to be, should be) in place.
310
311
h3. External Monitoring/Alerting
312
313
To be able to catch multiple uplink errors, there should be 2 external prometheus systems operating in a cluster for alerting (alertmanagers).
314
The retention period of these monitoring servers can be low, as their main purpose is link down detection. No internal services need to be monitored.
315
 
316
h3. External Uplink monitoring (IPv6, IPv4)
317
318
* There should be 2 external systems that monitor via ping the two routers per place
319
** Whether IPv4 and IPv6 are done by the same systems does not matter
320
** However there need to be 
321
*** 2 for IPv4 (place4, place7) 
322 1 Dominique Roux
*** 2 for IPv6 (place4, ?)
323 31 Nico Schottelius
* If all uplinks of one place are down for at least 5m, we send out an emergency alert
324 30 Nico Schottelius
325
h3. External DNS monitoring (IPv6, IPv4)
326
327
* There should be 2 external systems that monitor whether our authoritative DNS servers are working
328 1 Dominique Roux
* We query whether ipv4.ungleich.ch resolves to an IPv4 address
329
* We query whether ipv6.ungleich.ch resolves to an IPv6 address
330 31 Nico Schottelius
* If all external servers fail for 5m, we send out an emergency alert
331 1 Dominique Roux
332 31 Nico Schottelius
h3. Internal ceph monitors
333
334
* Monitor whether there is a quorom
335
** If there is no quorum for at least 15m, we send out an emergency alert
336
337
h3. Monitoring monitoring
338
339
* The internal monitors monitor whether the external monitors are reachable 
340 30 Nico Schottelius
341 27 Nico Schottelius
h2. Typical tasks
342
343
h3. Adding customer monitoring
344
345
Customers can have their own alerts. By default, if customer resources are monitored, we ...
346
347 29 Nico Schottelius
* If we do not have access to the VM: ask the user to setup prometheus node exporter and whitelist port 9100 to be accessible from 2a0a:e5c0:2:2:0:c8ff:fe68:bf3b
348
* Otherwise do above step ourselves
349 27 Nico Schottelius
* ensure the customer has an LDAP account
350 29 Nico Schottelius
** Ask the user to login with their LDAP user to https://monitoring-v3.ungleich.ch/ - this way grafana knows about the user (similar to redmine)
351 27 Nico Schottelius
* create a folder on grafana of https://monitoring-v3.ungleich.ch/ with the same name as the LDAP user (for instance "nicocustomer")
352
* Modify the permissions of the folder
353
** Remove the standard Viewer Role
354
** Add User -> the LDAP user -> View
355
356 28 Nico Schottelius
Setup a dashboard. If it allows selecting nodes:
357
358
* Limit the variable by defining the regex in the dashboard settings
359
360 27 Nico Schottelius
If the user requested alerts
361
362
* Configure them in cdist, type __dcl_monitoring_server2020/files/prometheus-v3/
363 29 Nico Schottelius
364
Finally:
365
366
<pre>
367
cdist config -v monitoring-v3.ungleich.ch
368
</pre>
369 20 Nico Schottelius
370
h2. Old Monitoring
371
372
Before 2020-07 our monitoring incorporated more services/had a different approach:
373
374
375
We used the following technology / products for the monitoring:
376
377
* consul (service discovery)
378
* prometheus (exporting, gathering, alerting)
379
* Grafana (presenting)
380
381
Prometheus and grafana are located on the monitoring control machines
382
383
* monitoring.place5.ungleich.ch
384
* monitoring.place6.ungleich.ch
385
386
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
387
388
389
h3. Consul
390
391
We used a consul cluster for each datacenter (e.g. place5 and place6). 
392
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
393
394
consul is configured to publish the service its host is providing (e.g. the exporters)
395
396
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
397
398
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
399
400
h3. Authentication
401
402
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
403
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
404
405
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.