Project

General

Profile

The ungleich monitoring infrastructure » History » Version 22

Nico Schottelius, 07/24/2020 02:39 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7 22 Nico Schottelius
We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
8
9 7 Nico Schottelius
h3. Configuring prometheus
10
11
Use @promtool check config@ to verify the configuration.
12
13
<pre>
14
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
15
Checking /etc/prometheus/prometheus.yml
16
  SUCCESS: 4 rule files found
17
18
Checking /etc/prometheus/blackbox.rules
19
  SUCCESS: 3 rules found
20
21
Checking /etc/prometheus/ceph-alerts.rules
22
  SUCCESS: 8 rules found
23
24
Checking /etc/prometheus/node-alerts.rules
25
  SUCCESS: 8 rules found
26
27
Checking /etc/prometheus/uplink-monitoring.rules
28
  SUCCESS: 1 rules found
29
30
</pre>
31 8 Nico Schottelius
32
h3. Querying prometheus
33
34
Use @promtool query instant@ to query values:
35
36
<pre>
37
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
38
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
39
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
40
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
41
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
42
</pre>
43 9 Nico Schottelius
44 11 Nico Schottelius
Typical queries:
45
46
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
47
48
<pre>
49
sum by (job) (probe_success)
50
51
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
52
'
53
{job="routers-place5"} => 4 @[1593961699.969]
54
{job="uplink-place5"} => 4 @[1593961699.969]
55
{job="routers-place6'"} => 4 @[1593961699.969]
56
{job="uplink-place6"} => 4 @[1593961699.969]
57
{job="core-services"} => 3 @[1593961699.969]
58
[17:08:19] server1.place11:/etc/prometheus# 
59
60
</pre>
61
62
63
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
64
65
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
66
67
The operator @on@ is used to filter
68
69
<pre>
70
sum(probe_success * on(instance) probe_ip_protocol == 4)
71
</pre>
72
73
74
Creating an alert:
75
76
* if the sum of all jobs of a certain regex and match on ip protocol is 0
77
** this particular job indicates total loss of connectivity
78
* We want to get a vector like this:
79
** job="routers-place5", protocol = 4 
80
** job="uplink-place5", protocol = 4 
81
** job="routers-place5", protocol = 6 
82
** job="uplink-place5", protocol = 6
83
84 9 Nico Schottelius
85 12 Nico Schottelius
Query for IPv4 of all routers:
86
87
<pre>
88
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
89
{job="routers-place5"} => 8 @[1593963562.281]
90
{job="routers-place6'"} => 8 @[1593963562.281]
91
</pre>
92
93
Query for all IPv4 of all routers:
94
95
<pre>
96
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
97
{job="routers-place5"} => 12 @[1593963626.483]
98
{job="routers-place6'"} => 12 @[1593963626.483]
99
[17:40:26] server1.place11:/etc/prometheus# 
100
</pre>
101
102
Query for all IPv6 uplinks:
103
104
<pre>
105
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
106
{job="uplink-place5"} => 12 @[1593963675.835]
107
{job="uplink-place6"} => 12 @[1593963675.835]
108
[17:41:15] server1.place11:/etc/prometheus# 
109
</pre>
110
111
112
Query for all IPv4 uplinks:
113
114
<pre>
115
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
116
{job="uplink-place5"} => 8 @[1593963698.108]
117
{job="uplink-place6"} => 8 @[1593963698.108]
118
119
</pre>
120
121
The values 8 and 12 means:
122
123
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
124
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
125
126
To normalise, we would need to divide by 4 (or 6):
127
128
<pre>
129
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
130
{job="uplink-place5"} => 2 @[1593963778.885]
131
{job="uplink-place6"} => 2 @[1593963778.885]
132
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
133
{job="uplink-place5"} => 2 @[1593963788.276]
134
{job="uplink-place6"} => 2 @[1593963788.276]
135
</pre>
136
137
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
138
139 9 Nico Schottelius
h3. Using Grafana
140
141
* Username for changing items: "admin"
142
* Username for viewing dashboards: "ungleich"
143
* Passwords in the password store
144 10 Nico Schottelius
145
h3. Managing alerts
146
147
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
148
* Use @amtool@
149
150
Showing current alerts:
151
152
<pre>
153 19 Nico Schottelius
# Alpine needs URL (why?)
154
amtool alert query --alertmanager.url=http://localhost:9093
155
156
# Debian
157
amtool alert query
158
</pre>
159
160
161
<pre>
162 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
163
Alertname            Starts At                 Summary                                                               
164
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
165
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
166
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
167
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
168
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
169
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
170
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
171
[14:54:41] monitoring.place6:~# 
172
</pre>
173
174
Silencing alerts:
175
176
<pre>
177
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
178
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
179
[15:00:06] monitoring.place6:~# amtool silence query
180
ID                                    Matchers                  Ends At                  Created By  Comment                
181
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
182
[15:00:13] monitoring.place6:~# 
183
</pre>
184
185 1 Dominique Roux
Better using author and co. TOBEFIXED
186 13 Nico Schottelius
187
h3. Severity levels
188
189
The following notions are used:
190 1 Dominique Roux
191 14 Nico Schottelius
* critical = panic = calling to the whole team
192 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
193 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
194 15 Nico Schottelius
195
h3. Labeling
196
197
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
198
199
* The @relabel_configs@ are applied BEFORE scraping
200
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
201
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
202
* metric_label_config does not apply to automatic labels like @up@ !
203 1 Dominique Roux
** You need to use relabel_configs
204 16 Nico Schottelius
205
h3. Setting "roles"
206
207
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
208
209
<pre>
210
    relabel_configs:
211 1 Dominique Roux
      - source_labels: [__address__]
212
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
213
        target_label:  'role'
214
        replacement:   '$1'
215
      - source_labels: [__address__]
216
        regex:         'ciara.*.ungleich.ch.*'
217
        target_label:  'role'
218
        replacement:   'server'
219
      - source_labels: [__address__]
220
        regex:         '.*:9283'
221
        target_label:  'role'
222
        replacement:   'ceph'
223
      - source_labels: [__address__]
224
        regex:         '((ciara2|ciara4).*)'
225
        target_label:  'role'
226
        replacement:   'down'
227
      - source_labels: [__address__]
228
        regex:         '.*(place.*).ungleich.ch.*'
229
        target_label:  'dc'
230
        replacement:   '$1'
231
</pre>
232
233
What happens here:
234
235
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
236
* We apply some roles by default (the server, monitor etc.)
237
* Special rule for ciara, which does not match the serverX pattern
238
* ciara2 and ciara4 in above example are intentionally down
239
* At the end we setup the "dc" label in case the host is in a place of ungleich
240
241
h3. Marking hosts down
242
243
If a host or service is intentionally down, **change its role** to **down**.
244
245
h3. SMS and Voice notifications
246
247
We use https://ecall.ch.
248
249
* For voice: mail to number@voice.ecall.ch
250
* For voice: mail to number@sms.ecall.ch
251
252
Uses email sender based authorization.
253
254
h3. Alertmanager clusters
255
256
* The outside monitors form one alertmanager cluster
257
* The inside monitors form one alertmanager cluster
258 21 Nico Schottelius
259
h3. Monit
260
261
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
262
263
h3. Service/Customer monitoring
264
265
* A few blackbox things can be found on the datacenter monitoring infrastructure.
266
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
267
268
269 20 Nico Schottelius
270
h2. Old Monitoring
271
272
Before 2020-07 our monitoring incorporated more services/had a different approach:
273
274
275
We used the following technology / products for the monitoring:
276
277
* consul (service discovery)
278
* prometheus (exporting, gathering, alerting)
279
* Grafana (presenting)
280
281
Prometheus and grafana are located on the monitoring control machines
282
283
* monitoring.place5.ungleich.ch
284
* monitoring.place6.ungleich.ch
285
286
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
287
288
289
h3. Consul
290
291
We used a consul cluster for each datacenter (e.g. place5 and place6). 
292
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
293
294
consul is configured to publish the service its host is providing (e.g. the exporters)
295
296
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
297
298
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
299
300
h3. Authentication
301
302
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
303
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
304
305
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.