Project

General

Profile

The ungleich monitoring infrastructure » History » Version 25

Nico Schottelius, 08/13/2020 09:06 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 23 Nico Schottelius
h3. Architecture overview
10
11
* There is *1 internal IPv6 only* monitoring system *per place*
12
** emonitor1.place5.ungleich.ch (real hardware)
13
** emonitor1.place6.ungleich.ch (real hardware)
14
** *Main role: alert if services are down*
15
* There is *1 external dual stack* monitoring system
16
** monitoring.place4.ungleich.ch
17
** *Main role: alert if one or more places are unreachable from outside*
18 25 Nico Schottelius
** Also monitors all nodes to be have all data available
19 23 Nico Schottelius
* There are *many monitored* systems
20
* Systems can be marked as intentionally down (but still kept monitored)
21 24 Nico Schottelius
* Monitoring systems are built with the least amount of external dependencies
22 23 Nico Schottelius
23 25 Nico Schottelius
h3. Adding a new production system
24
25
* Install the correct exporter (often: node_exporter)
26
* Limit access via nftables
27 23 Nico Schottelius
28 7 Nico Schottelius
h3. Configuring prometheus
29
30
Use @promtool check config@ to verify the configuration.
31
32
<pre>
33
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
34
Checking /etc/prometheus/prometheus.yml
35
  SUCCESS: 4 rule files found
36
37
Checking /etc/prometheus/blackbox.rules
38
  SUCCESS: 3 rules found
39
40
Checking /etc/prometheus/ceph-alerts.rules
41
  SUCCESS: 8 rules found
42
43
Checking /etc/prometheus/node-alerts.rules
44
  SUCCESS: 8 rules found
45
46
Checking /etc/prometheus/uplink-monitoring.rules
47
  SUCCESS: 1 rules found
48
49
</pre>
50 8 Nico Schottelius
51
h3. Querying prometheus
52
53
Use @promtool query instant@ to query values:
54
55
<pre>
56
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
57
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
58
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
59
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
60
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
61
</pre>
62 9 Nico Schottelius
63 11 Nico Schottelius
Typical queries:
64
65
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
66
67
<pre>
68
sum by (job) (probe_success)
69
70
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
71
'
72
{job="routers-place5"} => 4 @[1593961699.969]
73
{job="uplink-place5"} => 4 @[1593961699.969]
74
{job="routers-place6'"} => 4 @[1593961699.969]
75
{job="uplink-place6"} => 4 @[1593961699.969]
76
{job="core-services"} => 3 @[1593961699.969]
77
[17:08:19] server1.place11:/etc/prometheus# 
78
79
</pre>
80
81
82
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
83
84
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
85
86
The operator @on@ is used to filter
87
88
<pre>
89
sum(probe_success * on(instance) probe_ip_protocol == 4)
90
</pre>
91
92
93
Creating an alert:
94
95
* if the sum of all jobs of a certain regex and match on ip protocol is 0
96
** this particular job indicates total loss of connectivity
97
* We want to get a vector like this:
98
** job="routers-place5", protocol = 4 
99
** job="uplink-place5", protocol = 4 
100
** job="routers-place5", protocol = 6 
101
** job="uplink-place5", protocol = 6
102
103 9 Nico Schottelius
104 12 Nico Schottelius
Query for IPv4 of all routers:
105
106
<pre>
107
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
108
{job="routers-place5"} => 8 @[1593963562.281]
109
{job="routers-place6'"} => 8 @[1593963562.281]
110
</pre>
111
112
Query for all IPv4 of all routers:
113
114
<pre>
115
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
116
{job="routers-place5"} => 12 @[1593963626.483]
117
{job="routers-place6'"} => 12 @[1593963626.483]
118
[17:40:26] server1.place11:/etc/prometheus# 
119
</pre>
120
121
Query for all IPv6 uplinks:
122
123
<pre>
124
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
125
{job="uplink-place5"} => 12 @[1593963675.835]
126
{job="uplink-place6"} => 12 @[1593963675.835]
127
[17:41:15] server1.place11:/etc/prometheus# 
128
</pre>
129
130
131
Query for all IPv4 uplinks:
132
133
<pre>
134
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
135
{job="uplink-place5"} => 8 @[1593963698.108]
136
{job="uplink-place6"} => 8 @[1593963698.108]
137
138
</pre>
139
140
The values 8 and 12 means:
141
142
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
143
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
144
145
To normalise, we would need to divide by 4 (or 6):
146
147
<pre>
148
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
149
{job="uplink-place5"} => 2 @[1593963778.885]
150
{job="uplink-place6"} => 2 @[1593963778.885]
151
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
152
{job="uplink-place5"} => 2 @[1593963788.276]
153
{job="uplink-place6"} => 2 @[1593963788.276]
154
</pre>
155
156
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
157
158 9 Nico Schottelius
h3. Using Grafana
159
160
* Username for changing items: "admin"
161
* Username for viewing dashboards: "ungleich"
162
* Passwords in the password store
163 10 Nico Schottelius
164
h3. Managing alerts
165
166
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
167
* Use @amtool@
168
169
Showing current alerts:
170
171
<pre>
172 19 Nico Schottelius
# Alpine needs URL (why?)
173
amtool alert query --alertmanager.url=http://localhost:9093
174
175
# Debian
176
amtool alert query
177
</pre>
178
179
180
<pre>
181 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
182
Alertname            Starts At                 Summary                                                               
183
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
184
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
185
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
186
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
187
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
188
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
189
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
190
[14:54:41] monitoring.place6:~# 
191
</pre>
192
193
Silencing alerts:
194
195
<pre>
196
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
197
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
198
[15:00:06] monitoring.place6:~# amtool silence query
199
ID                                    Matchers                  Ends At                  Created By  Comment                
200
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
201
[15:00:13] monitoring.place6:~# 
202
</pre>
203
204 1 Dominique Roux
Better using author and co. TOBEFIXED
205 13 Nico Schottelius
206
h3. Severity levels
207
208
The following notions are used:
209 1 Dominique Roux
210 14 Nico Schottelius
* critical = panic = calling to the whole team
211 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
212 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
213 15 Nico Schottelius
214
h3. Labeling
215
216
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
217
218
* The @relabel_configs@ are applied BEFORE scraping
219
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
220
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
221
* metric_label_config does not apply to automatic labels like @up@ !
222 1 Dominique Roux
** You need to use relabel_configs
223 16 Nico Schottelius
224
h3. Setting "roles"
225
226
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
227
228
<pre>
229
    relabel_configs:
230 1 Dominique Roux
      - source_labels: [__address__]
231
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
232
        target_label:  'role'
233
        replacement:   '$1'
234
      - source_labels: [__address__]
235
        regex:         'ciara.*.ungleich.ch.*'
236
        target_label:  'role'
237
        replacement:   'server'
238
      - source_labels: [__address__]
239
        regex:         '.*:9283'
240
        target_label:  'role'
241
        replacement:   'ceph'
242
      - source_labels: [__address__]
243
        regex:         '((ciara2|ciara4).*)'
244
        target_label:  'role'
245
        replacement:   'down'
246
      - source_labels: [__address__]
247
        regex:         '.*(place.*).ungleich.ch.*'
248
        target_label:  'dc'
249
        replacement:   '$1'
250
</pre>
251
252
What happens here:
253
254
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
255
* We apply some roles by default (the server, monitor etc.)
256
* Special rule for ciara, which does not match the serverX pattern
257
* ciara2 and ciara4 in above example are intentionally down
258
* At the end we setup the "dc" label in case the host is in a place of ungleich
259
260
h3. Marking hosts down
261
262
If a host or service is intentionally down, **change its role** to **down**.
263
264
h3. SMS and Voice notifications
265
266
We use https://ecall.ch.
267
268
* For voice: mail to number@voice.ecall.ch
269
* For voice: mail to number@sms.ecall.ch
270
271
Uses email sender based authorization.
272
273
h3. Alertmanager clusters
274
275 21 Nico Schottelius
* The outside monitors form one alertmanager cluster
276
* The inside monitors form one alertmanager cluster
277
278
h3. Monit
279
280
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
281
282
h3. Service/Customer monitoring
283
284
* A few blackbox things can be found on the datacenter monitoring infrastructure.
285
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
286 20 Nico Schottelius
287
h2. Old Monitoring
288
289
Before 2020-07 our monitoring incorporated more services/had a different approach:
290
291
292
We used the following technology / products for the monitoring:
293
294
* consul (service discovery)
295
* prometheus (exporting, gathering, alerting)
296
* Grafana (presenting)
297
298
Prometheus and grafana are located on the monitoring control machines
299
300
* monitoring.place5.ungleich.ch
301
* monitoring.place6.ungleich.ch
302
303
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
304
305
306
h3. Consul
307
308
We used a consul cluster for each datacenter (e.g. place5 and place6). 
309
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
310
311
consul is configured to publish the service its host is providing (e.g. the exporters)
312
313
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
314
315
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
316
317
h3. Authentication
318
319
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
320
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
321
322
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.