Project

General

Profile

The ungleich monitoring infrastructure » History » Version 19

Nico Schottelius, 07/24/2020 11:44 AM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5
h2. Introduction
6
7 2 Dominique Roux
We use the following technology / products for the monitoring:
8
9
* consul (service discovery)
10
* prometheus (exporting, gathering, alerting)
11
* Grafana (presenting)
12
13 3 Dominique Roux
Prometheus and grafana are located on the monitoring control machines
14
15
* monitoring.place5.ungleich.ch
16
* monitoring.place6.ungleich.ch
17
18 1 Dominique Roux
h2. Consul
19
20 2 Dominique Roux
We use a consul cluster for each datacenter (e.g. place5 and place6). 
21
The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
22
23
consul is configured to publish the service its host is providing (e.g. the exporters)
24
25
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
26
27 1 Dominique Roux
h2. Prometheus
28 2 Dominique Roux
29
Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)
30
31
h3. Exporters
32
33
* Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
34
* Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
35
* blackbox (Metrics about online state of http/https services)
36
37
The node exporter is located on all monitored hosts
38
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
39
The blackbox exporter is located on the monitoring control machine itself.
40
41
h3. Alerts
42
43
We configured the following alerts:
44
45
* ceph osds down
46
* ceph health state is not OK
47
* ceph quorum not OK
48
* ceph pool disk usage too high
49
* ceph disk usage too high
50
* instance down
51
* disk usage too high
52
* Monitored website down
53 1 Dominique Roux
54
h2. Grafana
55 3 Dominique Roux
56
Grafana provides dashboards for the following:
57
58
* Node (metrics about CPU-, RAM-, Disk and so on usage)
59
* blackbox (metrics about the blackbox exporter)
60
* ceph (important metrics from the ceph exporter)
61
62
h3. Authentication
63
64 4 Dominique Roux
The grafana authentication works over ldap. (See [[The ungleich LDAP guide]])
65 3 Dominique Roux
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
66 5 Timothée Floure
67
h2. Monit
68
69
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist.
70
71
h2. Misc
72
73
* You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff.
74
* This page needs some love!
75 6 Timothée Floure
76 7 Nico Schottelius
h2. Service/Customer monitoring
77 6 Timothée Floure
78
* A few blackbox things can be found on the datacenter monitoring infrastructure.
79 1 Dominique Roux
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
80 7 Nico Schottelius
81
h2. Monitoring Guide
82
83
h3. Configuring prometheus
84
85
Use @promtool check config@ to verify the configuration.
86
87
<pre>
88
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
89
Checking /etc/prometheus/prometheus.yml
90
  SUCCESS: 4 rule files found
91
92
Checking /etc/prometheus/blackbox.rules
93
  SUCCESS: 3 rules found
94
95
Checking /etc/prometheus/ceph-alerts.rules
96
  SUCCESS: 8 rules found
97
98
Checking /etc/prometheus/node-alerts.rules
99
  SUCCESS: 8 rules found
100
101
Checking /etc/prometheus/uplink-monitoring.rules
102
  SUCCESS: 1 rules found
103
104
</pre>
105 8 Nico Schottelius
106
h3. Querying prometheus
107
108
Use @promtool query instant@ to query values:
109
110
<pre>
111
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
112
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
113
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
114
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
115
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
116
</pre>
117 9 Nico Schottelius
118 11 Nico Schottelius
Typical queries:
119
120
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
121
122
<pre>
123
sum by (job) (probe_success)
124
125
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
126
'
127
{job="routers-place5"} => 4 @[1593961699.969]
128
{job="uplink-place5"} => 4 @[1593961699.969]
129
{job="routers-place6'"} => 4 @[1593961699.969]
130
{job="uplink-place6"} => 4 @[1593961699.969]
131
{job="core-services"} => 3 @[1593961699.969]
132
[17:08:19] server1.place11:/etc/prometheus# 
133
134
</pre>
135
136
137
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
138
139
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
140
141
The operator @on@ is used to filter
142
143
<pre>
144
sum(probe_success * on(instance) probe_ip_protocol == 4)
145
</pre>
146
147
148
Creating an alert:
149
150
* if the sum of all jobs of a certain regex and match on ip protocol is 0
151
** this particular job indicates total loss of connectivity
152
* We want to get a vector like this:
153
** job="routers-place5", protocol = 4 
154
** job="uplink-place5", protocol = 4 
155
** job="routers-place5", protocol = 6 
156
** job="uplink-place5", protocol = 6
157
158 9 Nico Schottelius
159 12 Nico Schottelius
Query for IPv4 of all routers:
160
161
<pre>
162
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
163
{job="routers-place5"} => 8 @[1593963562.281]
164
{job="routers-place6'"} => 8 @[1593963562.281]
165
</pre>
166
167
Query for all IPv4 of all routers:
168
169
<pre>
170
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
171
{job="routers-place5"} => 12 @[1593963626.483]
172
{job="routers-place6'"} => 12 @[1593963626.483]
173
[17:40:26] server1.place11:/etc/prometheus# 
174
</pre>
175
176
Query for all IPv6 uplinks:
177
178
<pre>
179
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
180
{job="uplink-place5"} => 12 @[1593963675.835]
181
{job="uplink-place6"} => 12 @[1593963675.835]
182
[17:41:15] server1.place11:/etc/prometheus# 
183
</pre>
184
185
186
Query for all IPv4 uplinks:
187
188
<pre>
189
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
190
{job="uplink-place5"} => 8 @[1593963698.108]
191
{job="uplink-place6"} => 8 @[1593963698.108]
192
193
</pre>
194
195
The values 8 and 12 means:
196
197
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
198
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
199
200
To normalise, we would need to divide by 4 (or 6):
201
202
<pre>
203
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
204
{job="uplink-place5"} => 2 @[1593963778.885]
205
{job="uplink-place6"} => 2 @[1593963778.885]
206
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
207
{job="uplink-place5"} => 2 @[1593963788.276]
208
{job="uplink-place6"} => 2 @[1593963788.276]
209
</pre>
210
211
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
212
213 9 Nico Schottelius
h3. Using Grafana
214
215
* Username for changing items: "admin"
216
* Username for viewing dashboards: "ungleich"
217
* Passwords in the password store
218 10 Nico Schottelius
219
h3. Managing alerts
220
221
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
222
* Use @amtool@
223
224
Showing current alerts:
225
226
<pre>
227 19 Nico Schottelius
# Alpine needs URL (why?)
228
amtool alert query --alertmanager.url=http://localhost:9093
229
230
# Debian
231
amtool alert query
232
</pre>
233
234
235
<pre>
236 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
237
Alertname            Starts At                 Summary                                                               
238
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
239
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
240
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
241
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
242
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
243
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
244
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
245
[14:54:41] monitoring.place6:~# 
246
</pre>
247
248
Silencing alerts:
249
250
<pre>
251
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
252
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
253
[15:00:06] monitoring.place6:~# amtool silence query
254
ID                                    Matchers                  Ends At                  Created By  Comment                
255
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
256
[15:00:13] monitoring.place6:~# 
257
</pre>
258
259 1 Dominique Roux
Better using author and co. TOBEFIXED
260 13 Nico Schottelius
261
h3. Severity levels
262
263
The following notions are used:
264 1 Dominique Roux
265 14 Nico Schottelius
* critical = panic = calling to the whole team
266 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
267 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
268 15 Nico Schottelius
269
h3. Labeling
270
271
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
272
273
* The @relabel_configs@ are applied BEFORE scraping
274
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
275
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
276
* metric_label_config does not apply to automatic labels like @up@ !
277 1 Dominique Roux
** You need to use relabel_configs
278 16 Nico Schottelius
279
h3. Setting "roles"
280
281
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
282
283
<pre>
284
    relabel_configs:
285
      - source_labels: [__address__]
286
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
287
        target_label:  'role'
288
        replacement:   '$1'
289
      - source_labels: [__address__]
290
        regex:         'ciara.*.ungleich.ch.*'
291
        target_label:  'role'
292
        replacement:   'server'
293
      - source_labels: [__address__]
294
        regex:         '.*:9283'
295
        target_label:  'role'
296
        replacement:   'ceph'
297
      - source_labels: [__address__]
298
        regex:         '((ciara2|ciara4).*)'
299
        target_label:  'role'
300
        replacement:   'down'
301
      - source_labels: [__address__]
302
        regex:         '.*(place.*).ungleich.ch.*'
303
        target_label:  'dc'
304
        replacement:   '$1'
305
</pre>
306
307
What happens here:
308
309
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
310
* We apply some roles by default (the server, monitor etc.)
311
* Special rule for ciara, which does not match the serverX pattern
312
* ciara2 and ciara4 in above example are intentionally down
313
* At the end we setup the "dc" label in case the host is in a place of ungleich
314
315
h3. Marking hosts down
316
317 17 Nico Schottelius
If a host or service is intentionally down, **change its role** to **down**.
318
319
h3. SMS and Voice notifications
320
321
We use https://ecall.ch.
322
323
* For voice: mail to number@voice.ecall.ch
324
* For voice: mail to number@sms.ecall.ch
325
326
Uses email sender based authorization.
327 18 Nico Schottelius
328
h3. Alertmanager clusters
329
330
* The outside monitors form one alertmanager cluster
331
* The inside monitors form one alertmanager cluster