Project

General

Profile

The ungleich monitoring infrastructure » History » Version 20

Nico Schottelius, 07/24/2020 02:30 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5
h2. Prometheus
6 2 Dominique Roux
7
Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)
8
9
h3. Exporters
10
11
* Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
12
* Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
13
* blackbox (Metrics about online state of http/https services)
14
15
The node exporter is located on all monitored hosts
16
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
17
The blackbox exporter is located on the monitoring control machine itself.
18
19
20 5 Timothée Floure
h2. Monit
21
22
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist.
23
24
h2. Misc
25
26
* You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff.
27
* This page needs some love!
28 6 Timothée Floure
29 7 Nico Schottelius
h2. Service/Customer monitoring
30 6 Timothée Floure
31
* A few blackbox things can be found on the datacenter monitoring infrastructure.
32 1 Dominique Roux
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
33 7 Nico Schottelius
34
h2. Monitoring Guide
35
36
h3. Configuring prometheus
37
38
Use @promtool check config@ to verify the configuration.
39
40
<pre>
41
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
42
Checking /etc/prometheus/prometheus.yml
43
  SUCCESS: 4 rule files found
44
45
Checking /etc/prometheus/blackbox.rules
46
  SUCCESS: 3 rules found
47
48
Checking /etc/prometheus/ceph-alerts.rules
49
  SUCCESS: 8 rules found
50
51
Checking /etc/prometheus/node-alerts.rules
52
  SUCCESS: 8 rules found
53
54
Checking /etc/prometheus/uplink-monitoring.rules
55
  SUCCESS: 1 rules found
56
57
</pre>
58 8 Nico Schottelius
59
h3. Querying prometheus
60
61
Use @promtool query instant@ to query values:
62
63
<pre>
64
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
65
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
66
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
67
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
68
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
69
</pre>
70 9 Nico Schottelius
71 11 Nico Schottelius
Typical queries:
72
73
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
74
75
<pre>
76
sum by (job) (probe_success)
77
78
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
79
'
80
{job="routers-place5"} => 4 @[1593961699.969]
81
{job="uplink-place5"} => 4 @[1593961699.969]
82
{job="routers-place6'"} => 4 @[1593961699.969]
83
{job="uplink-place6"} => 4 @[1593961699.969]
84
{job="core-services"} => 3 @[1593961699.969]
85
[17:08:19] server1.place11:/etc/prometheus# 
86
87
</pre>
88
89
90
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
91
92
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
93
94
The operator @on@ is used to filter
95
96
<pre>
97
sum(probe_success * on(instance) probe_ip_protocol == 4)
98
</pre>
99
100
101
Creating an alert:
102
103
* if the sum of all jobs of a certain regex and match on ip protocol is 0
104
** this particular job indicates total loss of connectivity
105
* We want to get a vector like this:
106
** job="routers-place5", protocol = 4 
107
** job="uplink-place5", protocol = 4 
108
** job="routers-place5", protocol = 6 
109
** job="uplink-place5", protocol = 6
110
111 9 Nico Schottelius
112 12 Nico Schottelius
Query for IPv4 of all routers:
113
114
<pre>
115
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
116
{job="routers-place5"} => 8 @[1593963562.281]
117
{job="routers-place6'"} => 8 @[1593963562.281]
118
</pre>
119
120
Query for all IPv4 of all routers:
121
122
<pre>
123
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
124
{job="routers-place5"} => 12 @[1593963626.483]
125
{job="routers-place6'"} => 12 @[1593963626.483]
126
[17:40:26] server1.place11:/etc/prometheus# 
127
</pre>
128
129
Query for all IPv6 uplinks:
130
131
<pre>
132
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
133
{job="uplink-place5"} => 12 @[1593963675.835]
134
{job="uplink-place6"} => 12 @[1593963675.835]
135
[17:41:15] server1.place11:/etc/prometheus# 
136
</pre>
137
138
139
Query for all IPv4 uplinks:
140
141
<pre>
142
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
143
{job="uplink-place5"} => 8 @[1593963698.108]
144
{job="uplink-place6"} => 8 @[1593963698.108]
145
146
</pre>
147
148
The values 8 and 12 means:
149
150
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
151
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
152
153
To normalise, we would need to divide by 4 (or 6):
154
155
<pre>
156
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
157
{job="uplink-place5"} => 2 @[1593963778.885]
158
{job="uplink-place6"} => 2 @[1593963778.885]
159
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
160
{job="uplink-place5"} => 2 @[1593963788.276]
161
{job="uplink-place6"} => 2 @[1593963788.276]
162
</pre>
163
164
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
165
166 9 Nico Schottelius
h3. Using Grafana
167
168
* Username for changing items: "admin"
169
* Username for viewing dashboards: "ungleich"
170
* Passwords in the password store
171 10 Nico Schottelius
172
h3. Managing alerts
173
174
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
175
* Use @amtool@
176
177
Showing current alerts:
178
179
<pre>
180 19 Nico Schottelius
# Alpine needs URL (why?)
181
amtool alert query --alertmanager.url=http://localhost:9093
182
183
# Debian
184
amtool alert query
185
</pre>
186
187
188
<pre>
189 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
190
Alertname            Starts At                 Summary                                                               
191
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
192
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
193
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
194
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
195
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
196
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
197
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
198
[14:54:41] monitoring.place6:~# 
199
</pre>
200
201
Silencing alerts:
202
203
<pre>
204
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
205
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
206
[15:00:06] monitoring.place6:~# amtool silence query
207
ID                                    Matchers                  Ends At                  Created By  Comment                
208
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
209
[15:00:13] monitoring.place6:~# 
210
</pre>
211
212 1 Dominique Roux
Better using author and co. TOBEFIXED
213 13 Nico Schottelius
214
h3. Severity levels
215
216
The following notions are used:
217 1 Dominique Roux
218 14 Nico Schottelius
* critical = panic = calling to the whole team
219 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
220 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
221 15 Nico Schottelius
222
h3. Labeling
223
224
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
225
226
* The @relabel_configs@ are applied BEFORE scraping
227
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
228
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
229
* metric_label_config does not apply to automatic labels like @up@ !
230 1 Dominique Roux
** You need to use relabel_configs
231 16 Nico Schottelius
232
h3. Setting "roles"
233
234
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
235
236
<pre>
237
    relabel_configs:
238 1 Dominique Roux
      - source_labels: [__address__]
239
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
240
        target_label:  'role'
241
        replacement:   '$1'
242
      - source_labels: [__address__]
243
        regex:         'ciara.*.ungleich.ch.*'
244
        target_label:  'role'
245
        replacement:   'server'
246
      - source_labels: [__address__]
247
        regex:         '.*:9283'
248
        target_label:  'role'
249
        replacement:   'ceph'
250
      - source_labels: [__address__]
251
        regex:         '((ciara2|ciara4).*)'
252
        target_label:  'role'
253
        replacement:   'down'
254
      - source_labels: [__address__]
255
        regex:         '.*(place.*).ungleich.ch.*'
256
        target_label:  'dc'
257
        replacement:   '$1'
258
</pre>
259
260
What happens here:
261
262
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
263
* We apply some roles by default (the server, monitor etc.)
264
* Special rule for ciara, which does not match the serverX pattern
265
* ciara2 and ciara4 in above example are intentionally down
266
* At the end we setup the "dc" label in case the host is in a place of ungleich
267
268
h3. Marking hosts down
269
270
If a host or service is intentionally down, **change its role** to **down**.
271
272
h3. SMS and Voice notifications
273
274
We use https://ecall.ch.
275 16 Nico Schottelius
276
* For voice: mail to number@voice.ecall.ch
277
* For voice: mail to number@sms.ecall.ch
278
279
Uses email sender based authorization.
280
281
h3. Alertmanager clusters
282
283
* The outside monitors form one alertmanager cluster
284
* The inside monitors form one alertmanager cluster
285 20 Nico Schottelius
286
h2. Old Monitoring
287
288
Before 2020-07 our monitoring incorporated more services/had a different approach:
289
290
291
We used the following technology / products for the monitoring:
292
293
* consul (service discovery)
294
* prometheus (exporting, gathering, alerting)
295
* Grafana (presenting)
296
297
Prometheus and grafana are located on the monitoring control machines
298
299
* monitoring.place5.ungleich.ch
300
* monitoring.place6.ungleich.ch
301
302
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
303
304
305
h3. Consul
306
307
We used a consul cluster for each datacenter (e.g. place5 and place6). 
308
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
309
310
consul is configured to publish the service its host is providing (e.g. the exporters)
311
312
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
313
314
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
315
316
h3. Authentication
317
318
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
319
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
320
321
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.