The ungleich monitoring infrastructure » History » Version 26
Nico Schottelius, 08/13/2020 10:56 PM
1 | 1 | Dominique Roux | h1. The ungleich monitoring infrastructure |
---|---|---|---|
2 | |||
3 | {{>toc}} |
||
4 | |||
5 | 7 | Nico Schottelius | h2. Monitoring Guide |
6 | |||
7 | 22 | Nico Schottelius | We are using prometheus, grafana, blackbox_exporter and monit for monitoring. |
8 | |||
9 | 23 | Nico Schottelius | h3. Architecture overview |
10 | |||
11 | * There is *1 internal IPv6 only* monitoring system *per place* |
||
12 | ** emonitor1.place5.ungleich.ch (real hardware) |
||
13 | ** emonitor1.place6.ungleich.ch (real hardware) |
||
14 | ** *Main role: alert if services are down* |
||
15 | * There is *1 external dual stack* monitoring system |
||
16 | ** monitoring.place4.ungleich.ch |
||
17 | ** *Main role: alert if one or more places are unreachable from outside* |
||
18 | 25 | Nico Schottelius | ** Also monitors all nodes to be have all data available |
19 | 23 | Nico Schottelius | * There are *many monitored* systems |
20 | * Systems can be marked as intentionally down (but still kept monitored) |
||
21 | 24 | Nico Schottelius | * Monitoring systems are built with the least amount of external dependencies |
22 | 23 | Nico Schottelius | |
23 | 26 | Nico Schottelius | h3. Monitoring and Alerting workflow |
24 | |||
25 | * Once per day the SRE team checks the relevant dashboards |
||
26 | ** Are systems down that should not be? |
||
27 | ** Is there a trend visible of systems failing? |
||
28 | * If the monitoring system sent a notification about a failed system |
||
29 | ** The SRE team fixes it the same day if possible |
||
30 | * If the monitoring system sent a critical error message |
||
31 | ** Instant fixes are to be applied by the SRE team |
||
32 | |||
33 | 25 | Nico Schottelius | h3. Adding a new production system |
34 | |||
35 | * Install the correct exporter (often: node_exporter) |
||
36 | * Limit access via nftables |
||
37 | 23 | Nico Schottelius | |
38 | 7 | Nico Schottelius | h3. Configuring prometheus |
39 | |||
40 | Use @promtool check config@ to verify the configuration. |
||
41 | |||
42 | <pre> |
||
43 | [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml |
||
44 | Checking /etc/prometheus/prometheus.yml |
||
45 | SUCCESS: 4 rule files found |
||
46 | |||
47 | Checking /etc/prometheus/blackbox.rules |
||
48 | SUCCESS: 3 rules found |
||
49 | |||
50 | Checking /etc/prometheus/ceph-alerts.rules |
||
51 | SUCCESS: 8 rules found |
||
52 | |||
53 | Checking /etc/prometheus/node-alerts.rules |
||
54 | SUCCESS: 8 rules found |
||
55 | |||
56 | Checking /etc/prometheus/uplink-monitoring.rules |
||
57 | SUCCESS: 1 rules found |
||
58 | |||
59 | </pre> |
||
60 | 8 | Nico Schottelius | |
61 | h3. Querying prometheus |
||
62 | |||
63 | Use @promtool query instant@ to query values: |
||
64 | |||
65 | <pre> |
||
66 | [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' |
||
67 | probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
68 | probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
69 | probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
70 | probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
71 | </pre> |
||
72 | 9 | Nico Schottelius | |
73 | 11 | Nico Schottelius | Typical queries: |
74 | |||
75 | Creating a sum of all metrics that contains a common label. For instance summing over all jobs: |
||
76 | |||
77 | <pre> |
||
78 | sum by (job) (probe_success) |
||
79 | |||
80 | [17:07:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum by (job) (probe_success) |
||
81 | ' |
||
82 | {job="routers-place5"} => 4 @[1593961699.969] |
||
83 | {job="uplink-place5"} => 4 @[1593961699.969] |
||
84 | {job="routers-place6'"} => 4 @[1593961699.969] |
||
85 | {job="uplink-place6"} => 4 @[1593961699.969] |
||
86 | {job="core-services"} => 3 @[1593961699.969] |
||
87 | [17:08:19] server1.place11:/etc/prometheus# |
||
88 | |||
89 | </pre> |
||
90 | |||
91 | |||
92 | Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4 |
||
93 | |||
94 | * probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619] |
||
95 | |||
96 | The operator @on@ is used to filter |
||
97 | |||
98 | <pre> |
||
99 | sum(probe_success * on(instance) probe_ip_protocol == 4) |
||
100 | </pre> |
||
101 | |||
102 | |||
103 | Creating an alert: |
||
104 | |||
105 | * if the sum of all jobs of a certain regex and match on ip protocol is 0 |
||
106 | ** this particular job indicates total loss of connectivity |
||
107 | * We want to get a vector like this: |
||
108 | ** job="routers-place5", protocol = 4 |
||
109 | ** job="uplink-place5", protocol = 4 |
||
110 | ** job="routers-place5", protocol = 6 |
||
111 | ** job="uplink-place5", protocol = 6 |
||
112 | |||
113 | 9 | Nico Schottelius | |
114 | 12 | Nico Schottelius | Query for IPv4 of all routers: |
115 | |||
116 | <pre> |
||
117 | [17:09:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' |
||
118 | {job="routers-place5"} => 8 @[1593963562.281] |
||
119 | {job="routers-place6'"} => 8 @[1593963562.281] |
||
120 | </pre> |
||
121 | |||
122 | Query for all IPv4 of all routers: |
||
123 | |||
124 | <pre> |
||
125 | [17:39:22] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' |
||
126 | {job="routers-place5"} => 12 @[1593963626.483] |
||
127 | {job="routers-place6'"} => 12 @[1593963626.483] |
||
128 | [17:40:26] server1.place11:/etc/prometheus# |
||
129 | </pre> |
||
130 | |||
131 | Query for all IPv6 uplinks: |
||
132 | |||
133 | <pre> |
||
134 | [17:40:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' |
||
135 | {job="uplink-place5"} => 12 @[1593963675.835] |
||
136 | {job="uplink-place6"} => 12 @[1593963675.835] |
||
137 | [17:41:15] server1.place11:/etc/prometheus# |
||
138 | </pre> |
||
139 | |||
140 | |||
141 | Query for all IPv4 uplinks: |
||
142 | |||
143 | <pre> |
||
144 | [17:41:15] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' |
||
145 | {job="uplink-place5"} => 8 @[1593963698.108] |
||
146 | {job="uplink-place6"} => 8 @[1593963698.108] |
||
147 | |||
148 | </pre> |
||
149 | |||
150 | The values 8 and 12 means: |
||
151 | |||
152 | * 8 = 4 (ip version 4) * probe_success (2 routers are up) |
||
153 | * 8 = 6 (ip version 6) * probe_success (2 routers are up) |
||
154 | |||
155 | To normalise, we would need to divide by 4 (or 6): |
||
156 | |||
157 | <pre> |
||
158 | [17:41:38] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4' |
||
159 | {job="uplink-place5"} => 2 @[1593963778.885] |
||
160 | {job="uplink-place6"} => 2 @[1593963778.885] |
||
161 | [17:42:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6' |
||
162 | {job="uplink-place5"} => 2 @[1593963788.276] |
||
163 | {job="uplink-place6"} => 2 @[1593963788.276] |
||
164 | </pre> |
||
165 | |||
166 | However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0. |
||
167 | |||
168 | 9 | Nico Schottelius | h3. Using Grafana |
169 | |||
170 | * Username for changing items: "admin" |
||
171 | * Username for viewing dashboards: "ungleich" |
||
172 | * Passwords in the password store |
||
173 | 10 | Nico Schottelius | |
174 | h3. Managing alerts |
||
175 | |||
176 | * Read https://prometheus.io/docs/practices/alerting/ as an introduction |
||
177 | * Use @amtool@ |
||
178 | |||
179 | Showing current alerts: |
||
180 | |||
181 | <pre> |
||
182 | 19 | Nico Schottelius | # Alpine needs URL (why?) |
183 | amtool alert query --alertmanager.url=http://localhost:9093 |
||
184 | |||
185 | # Debian |
||
186 | amtool alert query |
||
187 | </pre> |
||
188 | |||
189 | |||
190 | <pre> |
||
191 | 10 | Nico Schottelius | [14:54:35] monitoring.place6:~# amtool alert query |
192 | Alertname Starts At Summary |
||
193 | InstanceDown 2020-07-01 10:24:03 CEST Instance red1.place5.ungleich.ch down |
||
194 | InstanceDown 2020-07-01 10:24:03 CEST Instance red3.place5.ungleich.ch down |
||
195 | InstanceDown 2020-07-05 12:51:03 CEST Instance apu-router2.place5.ungleich.ch down |
||
196 | UngleichServiceDown 2020-07-05 13:51:19 CEST Ungleich internal service https://staging.swiss-crowdfunder.com down |
||
197 | InstanceDown 2020-07-05 13:55:33 CEST Instance https://swiss-crowdfunder.com down |
||
198 | CephHealthSate 2020-07-05 13:59:49 CEST Ceph Cluster is not healthy. |
||
199 | LinthalHigh 2020-07-05 14:01:41 CEST Temperature on risinghf-19 is 32.10012512207032 |
||
200 | [14:54:41] monitoring.place6:~# |
||
201 | </pre> |
||
202 | |||
203 | Silencing alerts: |
||
204 | |||
205 | <pre> |
||
206 | [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate |
||
207 | 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa |
||
208 | [15:00:06] monitoring.place6:~# amtool silence query |
||
209 | ID Matchers Ends At Created By Comment |
||
210 | 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa alertname=CephHealthSate 2020-07-05 14:00:06 UTC root Ceph is actually fine |
||
211 | [15:00:13] monitoring.place6:~# |
||
212 | </pre> |
||
213 | |||
214 | 1 | Dominique Roux | Better using author and co. TOBEFIXED |
215 | 13 | Nico Schottelius | |
216 | h3. Severity levels |
||
217 | |||
218 | The following notions are used: |
||
219 | 1 | Dominique Roux | |
220 | 14 | Nico Schottelius | * critical = panic = calling to the whole team |
221 | 13 | Nico Schottelius | * warning = something needs to be fixed = email to sre, non paging |
222 | 14 | Nico Schottelius | * info = not good, might be an indication for fixing something, goes to a matrix room |
223 | 15 | Nico Schottelius | |
224 | h3. Labeling |
||
225 | |||
226 | Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some: |
||
227 | |||
228 | * The @relabel_configs@ are applied BEFORE scraping |
||
229 | * The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!) |
||
230 | * regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax |
||
231 | * metric_label_config does not apply to automatic labels like @up@ ! |
||
232 | 1 | Dominique Roux | ** You need to use relabel_configs |
233 | 16 | Nico Schottelius | |
234 | h3. Setting "roles" |
||
235 | |||
236 | We use the label "role" to define a primary purpose per host. Example from 2020-07-07: |
||
237 | |||
238 | <pre> |
||
239 | relabel_configs: |
||
240 | 1 | Dominique Roux | - source_labels: [__address__] |
241 | regex: '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*' |
||
242 | target_label: 'role' |
||
243 | replacement: '$1' |
||
244 | - source_labels: [__address__] |
||
245 | regex: 'ciara.*.ungleich.ch.*' |
||
246 | target_label: 'role' |
||
247 | replacement: 'server' |
||
248 | - source_labels: [__address__] |
||
249 | regex: '.*:9283' |
||
250 | target_label: 'role' |
||
251 | replacement: 'ceph' |
||
252 | - source_labels: [__address__] |
||
253 | regex: '((ciara2|ciara4).*)' |
||
254 | target_label: 'role' |
||
255 | replacement: 'down' |
||
256 | - source_labels: [__address__] |
||
257 | regex: '.*(place.*).ungleich.ch.*' |
||
258 | target_label: 'dc' |
||
259 | replacement: '$1' |
||
260 | </pre> |
||
261 | |||
262 | What happens here: |
||
263 | |||
264 | * __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100 |
||
265 | * We apply some roles by default (the server, monitor etc.) |
||
266 | * Special rule for ciara, which does not match the serverX pattern |
||
267 | * ciara2 and ciara4 in above example are intentionally down |
||
268 | * At the end we setup the "dc" label in case the host is in a place of ungleich |
||
269 | |||
270 | h3. Marking hosts down |
||
271 | |||
272 | If a host or service is intentionally down, **change its role** to **down**. |
||
273 | |||
274 | h3. SMS and Voice notifications |
||
275 | |||
276 | We use https://ecall.ch. |
||
277 | |||
278 | * For voice: mail to number@voice.ecall.ch |
||
279 | * For voice: mail to number@sms.ecall.ch |
||
280 | |||
281 | Uses email sender based authorization. |
||
282 | |||
283 | h3. Alertmanager clusters |
||
284 | |||
285 | 21 | Nico Schottelius | * The outside monitors form one alertmanager cluster |
286 | * The inside monitors form one alertmanager cluster |
||
287 | |||
288 | h3. Monit |
||
289 | |||
290 | We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co. |
||
291 | |||
292 | h3. Service/Customer monitoring |
||
293 | |||
294 | * A few blackbox things can be found on the datacenter monitoring infrastructure. |
||
295 | * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. |
||
296 | 20 | Nico Schottelius | |
297 | h2. Old Monitoring |
||
298 | |||
299 | Before 2020-07 our monitoring incorporated more services/had a different approach: |
||
300 | |||
301 | |||
302 | We used the following technology / products for the monitoring: |
||
303 | |||
304 | * consul (service discovery) |
||
305 | * prometheus (exporting, gathering, alerting) |
||
306 | * Grafana (presenting) |
||
307 | |||
308 | Prometheus and grafana are located on the monitoring control machines |
||
309 | |||
310 | * monitoring.place5.ungleich.ch |
||
311 | * monitoring.place6.ungleich.ch |
||
312 | |||
313 | The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink. |
||
314 | |||
315 | |||
316 | h3. Consul |
||
317 | |||
318 | We used a consul cluster for each datacenter (e.g. place5 and place6). |
||
319 | The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) |
||
320 | |||
321 | consul is configured to publish the service its host is providing (e.g. the exporters) |
||
322 | |||
323 | There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] |
||
324 | |||
325 | Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory. |
||
326 | |||
327 | h3. Authentication |
||
328 | |||
329 | The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]]) |
||
330 | All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers |
||
331 | |||
332 | This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure. |