Project

General

Profile

The ungleich monitoring infrastructure » History » Version 23

Nico Schottelius, 08/13/2020 09:01 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 23 Nico Schottelius
h3. Architecture overview
10
11
* There is *1 internal IPv6 only* monitoring system *per place*
12
** emonitor1.place5.ungleich.ch (real hardware)
13
** emonitor1.place6.ungleich.ch (real hardware)
14
** *Main role: alert if services are down*
15
* There is *1 external dual stack* monitoring system
16
** monitoring.place4.ungleich.ch
17
** *Main role: alert if one or more places are unreachable from outside*
18
* There are *many monitored* systems
19
* Systems can be marked as intentionally down (but still kept monitored)
20
21
22 7 Nico Schottelius
h3. Configuring prometheus
23
24
Use @promtool check config@ to verify the configuration.
25
26
<pre>
27
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
28
Checking /etc/prometheus/prometheus.yml
29
  SUCCESS: 4 rule files found
30
31
Checking /etc/prometheus/blackbox.rules
32
  SUCCESS: 3 rules found
33
34
Checking /etc/prometheus/ceph-alerts.rules
35
  SUCCESS: 8 rules found
36
37
Checking /etc/prometheus/node-alerts.rules
38
  SUCCESS: 8 rules found
39
40
Checking /etc/prometheus/uplink-monitoring.rules
41
  SUCCESS: 1 rules found
42
43
</pre>
44 8 Nico Schottelius
45
h3. Querying prometheus
46
47
Use @promtool query instant@ to query values:
48
49
<pre>
50
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
51
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
52
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
53
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
54
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
55
</pre>
56 9 Nico Schottelius
57 11 Nico Schottelius
Typical queries:
58
59
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
60
61
<pre>
62
sum by (job) (probe_success)
63
64
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
65
'
66
{job="routers-place5"} => 4 @[1593961699.969]
67
{job="uplink-place5"} => 4 @[1593961699.969]
68
{job="routers-place6'"} => 4 @[1593961699.969]
69
{job="uplink-place6"} => 4 @[1593961699.969]
70
{job="core-services"} => 3 @[1593961699.969]
71
[17:08:19] server1.place11:/etc/prometheus# 
72
73
</pre>
74
75
76
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
77
78
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
79
80
The operator @on@ is used to filter
81
82
<pre>
83
sum(probe_success * on(instance) probe_ip_protocol == 4)
84
</pre>
85
86
87
Creating an alert:
88
89
* if the sum of all jobs of a certain regex and match on ip protocol is 0
90
** this particular job indicates total loss of connectivity
91
* We want to get a vector like this:
92
** job="routers-place5", protocol = 4 
93
** job="uplink-place5", protocol = 4 
94
** job="routers-place5", protocol = 6 
95
** job="uplink-place5", protocol = 6
96
97 9 Nico Schottelius
98 12 Nico Schottelius
Query for IPv4 of all routers:
99
100
<pre>
101
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
102
{job="routers-place5"} => 8 @[1593963562.281]
103
{job="routers-place6'"} => 8 @[1593963562.281]
104
</pre>
105
106
Query for all IPv4 of all routers:
107
108
<pre>
109
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
110
{job="routers-place5"} => 12 @[1593963626.483]
111
{job="routers-place6'"} => 12 @[1593963626.483]
112
[17:40:26] server1.place11:/etc/prometheus# 
113
</pre>
114
115
Query for all IPv6 uplinks:
116
117
<pre>
118
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
119
{job="uplink-place5"} => 12 @[1593963675.835]
120
{job="uplink-place6"} => 12 @[1593963675.835]
121
[17:41:15] server1.place11:/etc/prometheus# 
122
</pre>
123
124
125
Query for all IPv4 uplinks:
126
127
<pre>
128
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
129
{job="uplink-place5"} => 8 @[1593963698.108]
130
{job="uplink-place6"} => 8 @[1593963698.108]
131
132
</pre>
133
134
The values 8 and 12 means:
135
136
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
137
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
138
139
To normalise, we would need to divide by 4 (or 6):
140
141
<pre>
142
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
143
{job="uplink-place5"} => 2 @[1593963778.885]
144
{job="uplink-place6"} => 2 @[1593963778.885]
145
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
146
{job="uplink-place5"} => 2 @[1593963788.276]
147
{job="uplink-place6"} => 2 @[1593963788.276]
148
</pre>
149
150
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
151
152 9 Nico Schottelius
h3. Using Grafana
153
154
* Username for changing items: "admin"
155
* Username for viewing dashboards: "ungleich"
156
* Passwords in the password store
157 10 Nico Schottelius
158
h3. Managing alerts
159
160
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
161
* Use @amtool@
162
163
Showing current alerts:
164
165
<pre>
166 19 Nico Schottelius
# Alpine needs URL (why?)
167
amtool alert query --alertmanager.url=http://localhost:9093
168
169
# Debian
170
amtool alert query
171
</pre>
172
173
174
<pre>
175 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
176
Alertname            Starts At                 Summary                                                               
177
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
178
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
179
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
180
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
181
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
182
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
183
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
184
[14:54:41] monitoring.place6:~# 
185
</pre>
186
187
Silencing alerts:
188
189
<pre>
190
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
191
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
192
[15:00:06] monitoring.place6:~# amtool silence query
193
ID                                    Matchers                  Ends At                  Created By  Comment                
194
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
195
[15:00:13] monitoring.place6:~# 
196
</pre>
197
198 1 Dominique Roux
Better using author and co. TOBEFIXED
199 13 Nico Schottelius
200
h3. Severity levels
201
202
The following notions are used:
203 1 Dominique Roux
204 14 Nico Schottelius
* critical = panic = calling to the whole team
205 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
206 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
207 15 Nico Schottelius
208
h3. Labeling
209
210
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
211
212
* The @relabel_configs@ are applied BEFORE scraping
213
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
214
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
215
* metric_label_config does not apply to automatic labels like @up@ !
216 1 Dominique Roux
** You need to use relabel_configs
217 16 Nico Schottelius
218
h3. Setting "roles"
219
220
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
221
222
<pre>
223
    relabel_configs:
224 1 Dominique Roux
      - source_labels: [__address__]
225
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
226
        target_label:  'role'
227
        replacement:   '$1'
228
      - source_labels: [__address__]
229
        regex:         'ciara.*.ungleich.ch.*'
230
        target_label:  'role'
231
        replacement:   'server'
232
      - source_labels: [__address__]
233
        regex:         '.*:9283'
234
        target_label:  'role'
235
        replacement:   'ceph'
236
      - source_labels: [__address__]
237
        regex:         '((ciara2|ciara4).*)'
238
        target_label:  'role'
239
        replacement:   'down'
240
      - source_labels: [__address__]
241
        regex:         '.*(place.*).ungleich.ch.*'
242
        target_label:  'dc'
243
        replacement:   '$1'
244
</pre>
245
246
What happens here:
247
248
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
249
* We apply some roles by default (the server, monitor etc.)
250
* Special rule for ciara, which does not match the serverX pattern
251
* ciara2 and ciara4 in above example are intentionally down
252
* At the end we setup the "dc" label in case the host is in a place of ungleich
253
254
h3. Marking hosts down
255
256
If a host or service is intentionally down, **change its role** to **down**.
257
258
h3. SMS and Voice notifications
259
260
We use https://ecall.ch.
261
262
* For voice: mail to number@voice.ecall.ch
263
* For voice: mail to number@sms.ecall.ch
264
265
Uses email sender based authorization.
266
267
h3. Alertmanager clusters
268
269 21 Nico Schottelius
* The outside monitors form one alertmanager cluster
270
* The inside monitors form one alertmanager cluster
271
272
h3. Monit
273
274
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
275
276
h3. Service/Customer monitoring
277
278
* A few blackbox things can be found on the datacenter monitoring infrastructure.
279
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
280 20 Nico Schottelius
281
h2. Old Monitoring
282
283
Before 2020-07 our monitoring incorporated more services/had a different approach:
284
285
286
We used the following technology / products for the monitoring:
287
288
* consul (service discovery)
289
* prometheus (exporting, gathering, alerting)
290
* Grafana (presenting)
291
292
Prometheus and grafana are located on the monitoring control machines
293
294
* monitoring.place5.ungleich.ch
295
* monitoring.place6.ungleich.ch
296
297
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
298
299
300
h3. Consul
301
302
We used a consul cluster for each datacenter (e.g. place5 and place6). 
303
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
304
305
consul is configured to publish the service its host is providing (e.g. the exporters)
306
307
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
308
309
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
310
311
h3. Authentication
312
313
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
314
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
315
316
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.