Project

General

Profile

The ungleich monitoring infrastructure » History » Version 21

Nico Schottelius, 07/24/2020 02:39 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5 7 Nico Schottelius
h2. Monitoring Guide
6
7
h3. Configuring prometheus
8
9
Use @promtool check config@ to verify the configuration.
10
11
<pre>
12
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
13
Checking /etc/prometheus/prometheus.yml
14
  SUCCESS: 4 rule files found
15
16
Checking /etc/prometheus/blackbox.rules
17
  SUCCESS: 3 rules found
18
19
Checking /etc/prometheus/ceph-alerts.rules
20
  SUCCESS: 8 rules found
21
22
Checking /etc/prometheus/node-alerts.rules
23
  SUCCESS: 8 rules found
24
25
Checking /etc/prometheus/uplink-monitoring.rules
26
  SUCCESS: 1 rules found
27
28
</pre>
29 8 Nico Schottelius
30
h3. Querying prometheus
31
32
Use @promtool query instant@ to query values:
33
34
<pre>
35
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
36
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
37
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
38
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
39
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
40
</pre>
41 9 Nico Schottelius
42 11 Nico Schottelius
Typical queries:
43
44
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
45
46
<pre>
47
sum by (job) (probe_success)
48
49
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
50
'
51
{job="routers-place5"} => 4 @[1593961699.969]
52
{job="uplink-place5"} => 4 @[1593961699.969]
53
{job="routers-place6'"} => 4 @[1593961699.969]
54
{job="uplink-place6"} => 4 @[1593961699.969]
55
{job="core-services"} => 3 @[1593961699.969]
56
[17:08:19] server1.place11:/etc/prometheus# 
57
58
</pre>
59
60
61
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
62
63
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
64
65
The operator @on@ is used to filter
66
67
<pre>
68
sum(probe_success * on(instance) probe_ip_protocol == 4)
69
</pre>
70
71
72
Creating an alert:
73
74
* if the sum of all jobs of a certain regex and match on ip protocol is 0
75
** this particular job indicates total loss of connectivity
76
* We want to get a vector like this:
77
** job="routers-place5", protocol = 4 
78
** job="uplink-place5", protocol = 4 
79
** job="routers-place5", protocol = 6 
80
** job="uplink-place5", protocol = 6
81
82 9 Nico Schottelius
83 12 Nico Schottelius
Query for IPv4 of all routers:
84
85
<pre>
86
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
87
{job="routers-place5"} => 8 @[1593963562.281]
88
{job="routers-place6'"} => 8 @[1593963562.281]
89
</pre>
90
91
Query for all IPv4 of all routers:
92
93
<pre>
94
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
95
{job="routers-place5"} => 12 @[1593963626.483]
96
{job="routers-place6'"} => 12 @[1593963626.483]
97
[17:40:26] server1.place11:/etc/prometheus# 
98
</pre>
99
100
Query for all IPv6 uplinks:
101
102
<pre>
103
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
104
{job="uplink-place5"} => 12 @[1593963675.835]
105
{job="uplink-place6"} => 12 @[1593963675.835]
106
[17:41:15] server1.place11:/etc/prometheus# 
107
</pre>
108
109
110
Query for all IPv4 uplinks:
111
112
<pre>
113
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
114
{job="uplink-place5"} => 8 @[1593963698.108]
115
{job="uplink-place6"} => 8 @[1593963698.108]
116
117
</pre>
118
119
The values 8 and 12 means:
120
121
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
122
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
123
124
To normalise, we would need to divide by 4 (or 6):
125
126
<pre>
127
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
128
{job="uplink-place5"} => 2 @[1593963778.885]
129
{job="uplink-place6"} => 2 @[1593963778.885]
130
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
131
{job="uplink-place5"} => 2 @[1593963788.276]
132
{job="uplink-place6"} => 2 @[1593963788.276]
133
</pre>
134
135
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
136
137 9 Nico Schottelius
h3. Using Grafana
138
139
* Username for changing items: "admin"
140
* Username for viewing dashboards: "ungleich"
141
* Passwords in the password store
142 10 Nico Schottelius
143
h3. Managing alerts
144
145
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
146
* Use @amtool@
147
148
Showing current alerts:
149
150
<pre>
151 19 Nico Schottelius
# Alpine needs URL (why?)
152
amtool alert query --alertmanager.url=http://localhost:9093
153
154
# Debian
155
amtool alert query
156
</pre>
157
158
159
<pre>
160 10 Nico Schottelius
[14:54:35] monitoring.place6:~# amtool alert query
161
Alertname            Starts At                 Summary                                                               
162
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
163
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
164
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
165
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
166
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
167
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
168
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
169
[14:54:41] monitoring.place6:~# 
170
</pre>
171
172
Silencing alerts:
173
174
<pre>
175
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
176
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
177
[15:00:06] monitoring.place6:~# amtool silence query
178
ID                                    Matchers                  Ends At                  Created By  Comment                
179
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
180
[15:00:13] monitoring.place6:~# 
181
</pre>
182
183 1 Dominique Roux
Better using author and co. TOBEFIXED
184 13 Nico Schottelius
185
h3. Severity levels
186
187
The following notions are used:
188 1 Dominique Roux
189 14 Nico Schottelius
* critical = panic = calling to the whole team
190 13 Nico Schottelius
* warning = something needs to be fixed = email to sre, non paging
191 14 Nico Schottelius
* info = not good, might be an indication for fixing something, goes to a matrix room
192 15 Nico Schottelius
193
h3. Labeling
194
195
Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
196
197
* The @relabel_configs@ are applied BEFORE scraping
198
* The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
199
* regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
200
* metric_label_config does not apply to automatic labels like @up@ !
201 1 Dominique Roux
** You need to use relabel_configs
202 16 Nico Schottelius
203
h3. Setting "roles"
204
205
We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
206
207
<pre>
208
    relabel_configs:
209 1 Dominique Roux
      - source_labels: [__address__]
210
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
211
        target_label:  'role'
212
        replacement:   '$1'
213
      - source_labels: [__address__]
214
        regex:         'ciara.*.ungleich.ch.*'
215
        target_label:  'role'
216
        replacement:   'server'
217
      - source_labels: [__address__]
218
        regex:         '.*:9283'
219
        target_label:  'role'
220
        replacement:   'ceph'
221
      - source_labels: [__address__]
222
        regex:         '((ciara2|ciara4).*)'
223
        target_label:  'role'
224
        replacement:   'down'
225
      - source_labels: [__address__]
226
        regex:         '.*(place.*).ungleich.ch.*'
227
        target_label:  'dc'
228
        replacement:   '$1'
229
</pre>
230
231
What happens here:
232
233
* __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
234
* We apply some roles by default (the server, monitor etc.)
235
* Special rule for ciara, which does not match the serverX pattern
236
* ciara2 and ciara4 in above example are intentionally down
237
* At the end we setup the "dc" label in case the host is in a place of ungleich
238
239
h3. Marking hosts down
240
241
If a host or service is intentionally down, **change its role** to **down**.
242
243
h3. SMS and Voice notifications
244
245
We use https://ecall.ch.
246
247
* For voice: mail to number@voice.ecall.ch
248
* For voice: mail to number@sms.ecall.ch
249
250
Uses email sender based authorization.
251
252
h3. Alertmanager clusters
253
254
* The outside monitors form one alertmanager cluster
255
* The inside monitors form one alertmanager cluster
256 21 Nico Schottelius
257
h3. Monit
258
259
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
260
261
h3. Service/Customer monitoring
262
263
* A few blackbox things can be found on the datacenter monitoring infrastructure.
264
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
265
266
267 20 Nico Schottelius
268
h2. Old Monitoring
269
270
Before 2020-07 our monitoring incorporated more services/had a different approach:
271
272
273
We used the following technology / products for the monitoring:
274
275
* consul (service discovery)
276
* prometheus (exporting, gathering, alerting)
277
* Grafana (presenting)
278
279
Prometheus and grafana are located on the monitoring control machines
280
281
* monitoring.place5.ungleich.ch
282
* monitoring.place6.ungleich.ch
283
284
The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
285
286
287
h3. Consul
288
289
We used a consul cluster for each datacenter (e.g. place5 and place6). 
290
The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
291
292
consul is configured to publish the service its host is providing (e.g. the exporters)
293
294
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
295
296
Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
297
298
h3. Authentication
299
300
The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
301
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
302
303
This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.