Project

General

Profile

The ungleich monitoring infrastructure » History » Version 24

Nico Schottelius, 08/13/2020 09:04 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 23 Nico Schottelius
h3. Architecture overview
10
11
* There is *1 internal IPv6 only* monitoring system *per place*
12
** emonitor1.place5.ungleich.ch (real hardware)
13
** emonitor1.place6.ungleich.ch (real hardware)
14
** *Main role: alert if services are down*
15
* There is *1 external dual stack* monitoring system
16
** monitoring.place4.ungleich.ch
17
** *Main role: alert if one or more places are unreachable from outside*
18
* There are *many monitored* systems
19
* Systems can be marked as intentionally down (but still kept monitored)
20 24 Nico Schottelius
* Monitoring systems are built with the least amount of external dependencies
21 23 Nico Schottelius
22
23 7 Nico Schottelius
h3. Configuring prometheus
24
25
Use @promtool check config@ to verify the configuration.
26
27
<pre>
28
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
29
Checking /etc/prometheus/prometheus.yml
30
  SUCCESS: 4 rule files found
31
32
Checking /etc/prometheus/blackbox.rules
33
  SUCCESS: 3 rules found
34
35
Checking /etc/prometheus/ceph-alerts.rules
36
  SUCCESS: 8 rules found
37
38
Checking /etc/prometheus/node-alerts.rules
39
  SUCCESS: 8 rules found
40
41
Checking /etc/prometheus/uplink-monitoring.rules
42
  SUCCESS: 1 rules found
43
44
</pre>
45 8 Nico Schottelius
46
h3. Querying prometheus
47
48
Use @promtool query instant@ to query values:
49
50
<pre>
51
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
52
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
53
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
54
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
55
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
56
</pre>
57 9 Nico Schottelius
58 11 Nico Schottelius
Typical queries:
59
60
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
61
62
<pre>
63
sum by (job) (probe_success)
64
65
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
66
'
67
{job="routers-place5"} => 4 @[1593961699.969]
68
{job="uplink-place5"} => 4 @[1593961699.969]
69
{job="routers-place6'"} => 4 @[1593961699.969]
70
{job="uplink-place6"} => 4 @[1593961699.969]
71
{job="core-services"} => 3 @[1593961699.969]
72
[17:08:19] server1.place11:/etc/prometheus# 
73
74
</pre>
75
76
77
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
78
79
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
80
81
The operator @on@ is used to filter
82
83
<pre>
84
sum(probe_success * on(instance) probe_ip_protocol == 4)
85
</pre>
86
87
88
Creating an alert:
89
90
* if the sum of all jobs of a certain regex and match on ip protocol is 0
91
** this particular job indicates total loss of connectivity
92
* We want to get a vector like this:
93
** job="routers-place5", protocol = 4 
94
** job="uplink-place5", protocol = 4 
95
** job="routers-place5", protocol = 6 
96
** job="uplink-place5", protocol = 6
97
98 9 Nico Schottelius
99 12 Nico Schottelius
Query for IPv4 of all routers:
100
101
<pre>
102
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
103
{job="routers-place5"} => 8 @[1593963562.281]
104
{job="routers-place6'"} => 8 @[1593963562.281]
105
</pre>
106
107
Query for all IPv4 of all routers:
108
109
<pre>
110
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
111
{job="routers-place5"} => 12 @[1593963626.483]
112
{job="routers-place6'"} => 12 @[1593963626.483]
113
[17:40:26] server1.place11:/etc/prometheus# 
114
</pre>
115
116
Query for all IPv6 uplinks:
117
118
<pre>
119
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
120
{job="uplink-place5"} => 12 @[1593963675.835]
121
{job="uplink-place6"} => 12 @[1593963675.835]
122
[17:41:15] server1.place11:/etc/prometheus# 
123
</pre>
124
125
126
Query for all IPv4 uplinks:
127
128
<pre>
129
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
130
{job="uplink-place5"} => 8 @[1593963698.108]
131
{job="uplink-place6"} => 8 @[1593963698.108]
132
133
</pre>
134
135
The values 8 and 12 means:
136
137
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
138
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
139
140
To normalise, we would need to divide by 4 (or 6):
141
142
<pre>
143
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
144
{job="uplink-place5"} => 2 @[1593963778.885]
145
{job="uplink-place6"} => 2 @[1593963778.885]
146
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
147
{job="uplink-place5"} => 2 @[1593963788.276]
148
{job="uplink-place6"} => 2 @[1593963788.276]
149
</pre>
150
151
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
152
153 9 Nico Schottelius
h3. Using Grafana
154
155
* Username for changing items: "admin"
156
* Username for viewing dashboards: "ungleich"
157
* Passwords in the password store
158 10 Nico Schottelius
159
h3. Managing alerts
160
161
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
162
* Use @amtool@
163
164
Showing current alerts:
165
166
<pre>
167 19 Nico Schottelius
# Alpine needs URL (why?)
168
amtool alert query --alertmanager.url=http://localhost:9093
169
170
# Debian
171
amtool alert query
172
</pre>
173
174
175
<pre>
176 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
177
Alertname            Starts At                 Summary                                                               
178
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
179
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
180
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
181
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
182
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
183
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
184
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
185
[14:54:41] monitoring.place6:~# 
186
</pre>
187
188
Silencing alerts:
189
190
<pre>
191
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
192
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
193
[15:00:06] monitoring.place6:~# amtool silence query
194
ID                                    Matchers                  Ends At                  Created By  Comment                
195
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
196
[15:00:13] monitoring.place6:~# 
197
</pre>
198
199 1 Dominique Roux
Better using author and co. TOBEFIXED
200 13 Nico Schottelius
201
h3. Severity levels
202
203
The following notions are used:
204 1 Dominique Roux
205 14 Nico Schottelius
* critical = panic = calling to the whole team
206 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
207 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
208 15 Nico Schottelius
209
h3. Labeling
210
211
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
212
213
* The @relabel_configs@ are applied BEFORE scraping
214
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
215
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
216
* metric_label_config does not apply to automatic labels like @up@ !
217 1 Dominique Roux
** You need to use relabel_configs
218 16 Nico Schottelius
219
h3. Setting "roles"
220
221
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
222
223
<pre>
224
    relabel_configs:
225 1 Dominique Roux
      - source_labels: [__address__]
226
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
227
        target_label:  'role'
228
        replacement:   '$1'
229
      - source_labels: [__address__]
230
        regex:         'ciara.*.ungleich.ch.*'
231
        target_label:  'role'
232
        replacement:   'server'
233
      - source_labels: [__address__]
234
        regex:         '.*:9283'
235
        target_label:  'role'
236
        replacement:   'ceph'
237
      - source_labels: [__address__]
238
        regex:         '((ciara2|ciara4).*)'
239
        target_label:  'role'
240
        replacement:   'down'
241
      - source_labels: [__address__]
242
        regex:         '.*(place.*).ungleich.ch.*'
243
        target_label:  'dc'
244
        replacement:   '$1'
245
</pre>
246
247
What happens here:
248
249
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
250
* We apply some roles by default (the server, monitor etc.)
251
* Special rule for ciara, which does not match the serverX pattern
252
* ciara2 and ciara4 in above example are intentionally down
253
* At the end we setup the "dc" label in case the host is in a place of ungleich
254
255
h3. Marking hosts down
256
257
If a host or service is intentionally down, **change its role** to **down**.
258
259
h3. SMS and Voice notifications
260
261
We use https://ecall.ch.
262
263
* For voice: mail to number@voice.ecall.ch
264
* For voice: mail to number@sms.ecall.ch
265
266
Uses email sender based authorization.
267
268
h3. Alertmanager clusters
269
270 21 Nico Schottelius
* The outside monitors form one alertmanager cluster
271
* The inside monitors form one alertmanager cluster
272
273
h3. Monit
274
275
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
276
277
h3. Service/Customer monitoring
278
279
* A few blackbox things can be found on the datacenter monitoring infrastructure.
280
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
281 20 Nico Schottelius
282
h2. Old Monitoring
283
284
Before 2020-07 our monitoring incorporated more services/had a different approach:
285
286
287
We used the following technology / products for the monitoring:
288
289
* consul (service discovery)
290
* prometheus (exporting, gathering, alerting)
291
* Grafana (presenting)
292
293
Prometheus and grafana are located on the monitoring control machines
294
295
* monitoring.place5.ungleich.ch
296
* monitoring.place6.ungleich.ch
297
298
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
299
300
301
h3. Consul
302
303
We used a consul cluster for each datacenter (e.g. place5 and place6). 
304
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
305
306
consul is configured to publish the service its host is providing (e.g. the exporters)
307
308
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
309
310
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
311
312
h3. Authentication
313
314
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
315
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
316
317
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.