Project

General

Profile

The ungleich monitoring infrastructure » History » Version 30

Nico Schottelius, 04/09/2021 04:35 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 23 Nico Schottelius
h3. Architecture overview
10
11
* There is *1 internal IPv6 only* monitoring system *per place*
12
** emonitor1.place5.ungleich.ch (real hardware)
13
** emonitor1.place6.ungleich.ch (real hardware)
14
** *Main role: alert if services are down*
15
* There is *1 external dual stack* monitoring system
16
** monitoring.place4.ungleich.ch
17
** *Main role: alert if one or more places are unreachable from outside*
18 25 Nico Schottelius
** Also monitors all nodes to be have all data available
19 27 Nico Schottelius
* There is *1 customer enabled* monitoring system
20
** monitoring-v3.ungleich.ch
21
** Uses LDAP
22
** Runs on a VM
23 23 Nico Schottelius
* There are *many monitored* systems
24
* Systems can be marked as intentionally down (but still kept monitored)
25 24 Nico Schottelius
* Monitoring systems are built with the least amount of external dependencies
26 23 Nico Schottelius
27 26 Nico Schottelius
h3. Monitoring and Alerting workflow
28
29
* Once per day the SRE team checks the relevant dashboards
30
** Are systems down that should not be?
31
** Is there a trend visible of systems failing?
32
* If the monitoring system sent a notification about a failed system
33
** The SRE team fixes it the same day if possible
34
* If the monitoring system sent a critical error message
35
** Instant fixes are to be applied by the SRE team
36
37 25 Nico Schottelius
h3. Adding a new production system
38
39
* Install the correct exporter (often: node_exporter)
40
* Limit access via nftables
41 23 Nico Schottelius
42 7 Nico Schottelius
h3. Configuring prometheus
43
44
Use @promtool check config@ to verify the configuration.
45
46
<pre>
47
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
48
Checking /etc/prometheus/prometheus.yml
49
  SUCCESS: 4 rule files found
50
51
Checking /etc/prometheus/blackbox.rules
52
  SUCCESS: 3 rules found
53
54
Checking /etc/prometheus/ceph-alerts.rules
55
  SUCCESS: 8 rules found
56
57
Checking /etc/prometheus/node-alerts.rules
58
  SUCCESS: 8 rules found
59
60
Checking /etc/prometheus/uplink-monitoring.rules
61
  SUCCESS: 1 rules found
62
63
</pre>
64 8 Nico Schottelius
65
h3. Querying prometheus
66
67
Use @promtool query instant@ to query values:
68
69
<pre>
70
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
71
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
72
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
73
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
74
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
75
</pre>
76 9 Nico Schottelius
77 11 Nico Schottelius
Typical queries:
78
79
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
80
81
<pre>
82
sum by (job) (probe_success)
83
84
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
85
'
86
{job="routers-place5"} => 4 @[1593961699.969]
87
{job="uplink-place5"} => 4 @[1593961699.969]
88
{job="routers-place6'"} => 4 @[1593961699.969]
89
{job="uplink-place6"} => 4 @[1593961699.969]
90
{job="core-services"} => 3 @[1593961699.969]
91
[17:08:19] server1.place11:/etc/prometheus# 
92
93
</pre>
94
95
96
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
97
98
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
99
100
The operator @on@ is used to filter
101
102
<pre>
103
sum(probe_success * on(instance) probe_ip_protocol == 4)
104
</pre>
105
106
107
Creating an alert:
108
109
* if the sum of all jobs of a certain regex and match on ip protocol is 0
110
** this particular job indicates total loss of connectivity
111
* We want to get a vector like this:
112
** job="routers-place5", protocol = 4 
113
** job="uplink-place5", protocol = 4 
114
** job="routers-place5", protocol = 6 
115
** job="uplink-place5", protocol = 6
116
117 9 Nico Schottelius
118 12 Nico Schottelius
Query for IPv4 of all routers:
119
120
<pre>
121
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
122
{job="routers-place5"} => 8 @[1593963562.281]
123
{job="routers-place6'"} => 8 @[1593963562.281]
124
</pre>
125
126
Query for all IPv4 of all routers:
127
128
<pre>
129
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
130
{job="routers-place5"} => 12 @[1593963626.483]
131
{job="routers-place6'"} => 12 @[1593963626.483]
132
[17:40:26] server1.place11:/etc/prometheus# 
133
</pre>
134
135
Query for all IPv6 uplinks:
136
137
<pre>
138
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
139
{job="uplink-place5"} => 12 @[1593963675.835]
140
{job="uplink-place6"} => 12 @[1593963675.835]
141
[17:41:15] server1.place11:/etc/prometheus# 
142
</pre>
143
144
145
Query for all IPv4 uplinks:
146
147
<pre>
148
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
149
{job="uplink-place5"} => 8 @[1593963698.108]
150
{job="uplink-place6"} => 8 @[1593963698.108]
151
152
</pre>
153
154
The values 8 and 12 means:
155
156
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
157
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
158
159
To normalise, we would need to divide by 4 (or 6):
160
161
<pre>
162
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
163
{job="uplink-place5"} => 2 @[1593963778.885]
164
{job="uplink-place6"} => 2 @[1593963778.885]
165
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
166
{job="uplink-place5"} => 2 @[1593963788.276]
167
{job="uplink-place6"} => 2 @[1593963788.276]
168
</pre>
169
170
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
171
172 9 Nico Schottelius
h3. Using Grafana
173
174
* Username for changing items: "admin"
175
* Username for viewing dashboards: "ungleich"
176
* Passwords in the password store
177 10 Nico Schottelius
178
h3. Managing alerts
179
180
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
181
* Use @amtool@
182
183
Showing current alerts:
184
185
<pre>
186 19 Nico Schottelius
# Alpine needs URL (why?)
187
amtool alert query --alertmanager.url=http://localhost:9093
188
189
# Debian
190
amtool alert query
191
</pre>
192
193
194
<pre>
195 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
196
Alertname            Starts At                 Summary                                                               
197
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
198
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
199
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
200
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
201
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
202
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
203
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
204
[14:54:41] monitoring.place6:~# 
205
</pre>
206
207
Silencing alerts:
208
209
<pre>
210
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
211
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
212
[15:00:06] monitoring.place6:~# amtool silence query
213
ID                                    Matchers                  Ends At                  Created By  Comment                
214
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
215
[15:00:13] monitoring.place6:~# 
216
</pre>
217
218 1 Dominique Roux
Better using author and co. TOBEFIXED
219 13 Nico Schottelius
220
h3. Severity levels
221
222
The following notions are used:
223 1 Dominique Roux
224 14 Nico Schottelius
* critical = panic = calling to the whole team
225 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
226 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
227 15 Nico Schottelius
228
h3. Labeling
229
230
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
231
232
* The @relabel_configs@ are applied BEFORE scraping
233
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
234
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
235
* metric_label_config does not apply to automatic labels like @up@ !
236 1 Dominique Roux
** You need to use relabel_configs
237 16 Nico Schottelius
238
h3. Setting "roles"
239
240
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
241
242
<pre>
243
    relabel_configs:
244 1 Dominique Roux
      - source_labels: [__address__]
245
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
246
        target_label:  'role'
247
        replacement:   '$1'
248
      - source_labels: [__address__]
249
        regex:         'ciara.*.ungleich.ch.*'
250
        target_label:  'role'
251
        replacement:   'server'
252
      - source_labels: [__address__]
253
        regex:         '.*:9283'
254
        target_label:  'role'
255
        replacement:   'ceph'
256
      - source_labels: [__address__]
257
        regex:         '((ciara2|ciara4).*)'
258
        target_label:  'role'
259
        replacement:   'down'
260
      - source_labels: [__address__]
261
        regex:         '.*(place.*).ungleich.ch.*'
262
        target_label:  'dc'
263
        replacement:   '$1'
264
</pre>
265
266
What happens here:
267
268
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
269
* We apply some roles by default (the server, monitor etc.)
270
* Special rule for ciara, which does not match the serverX pattern
271
* ciara2 and ciara4 in above example are intentionally down
272
* At the end we setup the "dc" label in case the host is in a place of ungleich
273
274
h3. Marking hosts down
275
276
If a host or service is intentionally down, **change its role** to **down**.
277
278
h3. SMS and Voice notifications
279
280
We use https://ecall.ch.
281
282
* For voice: mail to number@voice.ecall.ch
283
* For voice: mail to number@sms.ecall.ch
284
285
Uses email sender based authorization.
286
287
h3. Alertmanager clusters
288
289 21 Nico Schottelius
* The outside monitors form one alertmanager cluster
290
* The inside monitors form one alertmanager cluster
291
292
h3. Monit
293
294
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
295
296
h3. Service/Customer monitoring
297
298
* A few blackbox things can be found on the datacenter monitoring infrastructure.
299
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
300 27 Nico Schottelius
301 30 Nico Schottelius
302
h2. Monitoring Rules
303
304
The following is a description of logical rules that (are, need to be, should be) in place.
305
306
h3. External Monitoring/Alerting
307
308
To be able to catch multiple uplink errors, there should be 2 external prometheus systems operating in a cluster for alerting (alertmanagers).
309
The retention period of these monitoring servers can be low, as their main purpose is link down detection. No internal services need to be monitored.
310
 
311
h3. External Uplink monitoring (IPv6, IPv4)
312
313
* There should be 2 external systems that monitor via ping the two routers per place
314
** Whether IPv4 and IPv6 are done by the same systems does not matter
315
** However there need to be 
316
*** 2 for IPv4 (place4, place7) 
317
*** 2 for IPv6 (place4, ?)
318
* If all uplinks of one place are down, we send out an emergency alert
319
320
h3. External DNS monitoring (IPv6, IPv4)
321
322
* There should be 2 external systems that monitor whether our authoritative DNS servers are working
323
* We query whether ipv4.ungleich.ch resolves to an IPv4 address
324
* We query whether ipv6.ungleich.ch resolves to an IPv6 address
325
* If all external servers fail, we send out an emergency alert
326
327
328 27 Nico Schottelius
h2. Typical tasks
329
330
h3. Adding customer monitoring
331
332
Customers can have their own alerts. By default, if customer resources are monitored, we ...
333
334 29 Nico Schottelius
* If we do not have access to the VM: ask the user to setup prometheus node exporter and whitelist port 9100 to be accessible from 2a0a:e5c0:2:2:0:c8ff:fe68:bf3b
335
* Otherwise do above step ourselves
336 27 Nico Schottelius
* ensure the customer has an LDAP account
337 29 Nico Schottelius
** Ask the user to login with their LDAP user to https://monitoring-v3.ungleich.ch/ - this way grafana knows about the user (similar to redmine)
338 27 Nico Schottelius
* create a folder on grafana of https://monitoring-v3.ungleich.ch/ with the same name as the LDAP user (for instance "nicocustomer")
339
* Modify the permissions of the folder
340
** Remove the standard Viewer Role
341
** Add User -> the LDAP user -> View
342
343 28 Nico Schottelius
Setup a dashboard. If it allows selecting nodes:
344
345
* Limit the variable by defining the regex in the dashboard settings
346
347 27 Nico Schottelius
If the user requested alerts
348
349
* Configure them in cdist, type __dcl_monitoring_server2020/files/prometheus-v3/
350 29 Nico Schottelius
351
Finally:
352
353
<pre>
354
cdist config -v monitoring-v3.ungleich.ch
355
</pre>
356 20 Nico Schottelius
357
h2. Old Monitoring
358
359
Before 2020-07 our monitoring incorporated more services/had a different approach:
360
361
362
We used the following technology / products for the monitoring:
363
364
* consul (service discovery)
365
* prometheus (exporting, gathering, alerting)
366
* Grafana (presenting)
367
368
Prometheus and grafana are located on the monitoring control machines
369
370
* monitoring.place5.ungleich.ch
371
* monitoring.place6.ungleich.ch
372
373
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
374
375
376
h3. Consul
377
378
We used a consul cluster for each datacenter (e.g. place5 and place6). 
379
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
380
381
consul is configured to publish the service its host is providing (e.g. the exporters)
382
383
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
384
385
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
386
387
h3. Authentication
388
389
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
390
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
391
392
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.