The ungleich monitoring infrastructure » History » Version 33
Dominique Roux, 05/07/2021 08:38 AM
| 1 | 1 | Dominique Roux | h1. The ungleich monitoring infrastructure |
|---|---|---|---|
| 2 | |||
| 3 | {{>toc}} |
||
| 4 | |||
| 5 | 7 | Nico Schottelius | h2. Monitoring Guide |
| 6 | |||
| 7 | 22 | Nico Schottelius | We are using prometheus, grafana, blackbox_exporter and monit for monitoring. |
| 8 | |||
| 9 | 23 | Nico Schottelius | h3. Architecture overview |
| 10 | |||
| 11 | * There is *1 internal IPv6 only* monitoring system *per place* |
||
| 12 | ** emonitor1.place5.ungleich.ch (real hardware) |
||
| 13 | ** emonitor1.place6.ungleich.ch (real hardware) |
||
| 14 | ** *Main role: alert if services are down* |
||
| 15 | * There is *1 external dual stack* monitoring system |
||
| 16 | ** monitoring.place4.ungleich.ch |
||
| 17 | ** *Main role: alert if one or more places are unreachable from outside* |
||
| 18 | 25 | Nico Schottelius | ** Also monitors all nodes to be have all data available |
| 19 | 27 | Nico Schottelius | * There is *1 customer enabled* monitoring system |
| 20 | ** monitoring-v3.ungleich.ch |
||
| 21 | ** Uses LDAP |
||
| 22 | ** Runs on a VM |
||
| 23 | 23 | Nico Schottelius | * There are *many monitored* systems |
| 24 | * Systems can be marked as intentionally down (but still kept monitored) |
||
| 25 | 24 | Nico Schottelius | * Monitoring systems are built with the least amount of external dependencies |
| 26 | 23 | Nico Schottelius | |
| 27 | 26 | Nico Schottelius | h3. Monitoring and Alerting workflow |
| 28 | |||
| 29 | * Once per day the SRE team checks the relevant dashboards |
||
| 30 | ** Are systems down that should not be? |
||
| 31 | ** Is there a trend visible of systems failing? |
||
| 32 | * If the monitoring system sent a notification about a failed system |
||
| 33 | ** The SRE team fixes it the same day if possible |
||
| 34 | * If the monitoring system sent a critical error message |
||
| 35 | ** Instant fixes are to be applied by the SRE team |
||
| 36 | |||
| 37 | 25 | Nico Schottelius | h3. Adding a new production system |
| 38 | |||
| 39 | * Install the correct exporter (often: node_exporter) |
||
| 40 | * Limit access via nftables |
||
| 41 | 23 | Nico Schottelius | |
| 42 | 7 | Nico Schottelius | h3. Configuring prometheus |
| 43 | |||
| 44 | Use @promtool check config@ to verify the configuration. |
||
| 45 | |||
| 46 | <pre> |
||
| 47 | [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml |
||
| 48 | Checking /etc/prometheus/prometheus.yml |
||
| 49 | SUCCESS: 4 rule files found |
||
| 50 | |||
| 51 | Checking /etc/prometheus/blackbox.rules |
||
| 52 | SUCCESS: 3 rules found |
||
| 53 | |||
| 54 | Checking /etc/prometheus/ceph-alerts.rules |
||
| 55 | SUCCESS: 8 rules found |
||
| 56 | |||
| 57 | Checking /etc/prometheus/node-alerts.rules |
||
| 58 | SUCCESS: 8 rules found |
||
| 59 | |||
| 60 | Checking /etc/prometheus/uplink-monitoring.rules |
||
| 61 | SUCCESS: 1 rules found |
||
| 62 | |||
| 63 | </pre> |
||
| 64 | 8 | Nico Schottelius | |
| 65 | 33 | Dominique Roux | h3. Configuring emonitors |
| 66 | 32 | Dominique Roux | |
| 67 | <pre> |
||
| 68 | cdist config -bj7 -p3 -vv emonitor1.place{5,6,7}.ungleich.ch |
||
| 69 | </pre> |
||
| 70 | |||
| 71 | 8 | Nico Schottelius | h3. Querying prometheus |
| 72 | |||
| 73 | Use @promtool query instant@ to query values: |
||
| 74 | |||
| 75 | <pre> |
||
| 76 | [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' |
||
| 77 | probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
| 78 | probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
| 79 | probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
| 80 | probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
| 81 | </pre> |
||
| 82 | 9 | Nico Schottelius | |
| 83 | 11 | Nico Schottelius | Typical queries: |
| 84 | |||
| 85 | Creating a sum of all metrics that contains a common label. For instance summing over all jobs: |
||
| 86 | |||
| 87 | <pre> |
||
| 88 | sum by (job) (probe_success) |
||
| 89 | |||
| 90 | [17:07:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum by (job) (probe_success) |
||
| 91 | ' |
||
| 92 | {job="routers-place5"} => 4 @[1593961699.969] |
||
| 93 | {job="uplink-place5"} => 4 @[1593961699.969] |
||
| 94 | {job="routers-place6'"} => 4 @[1593961699.969] |
||
| 95 | {job="uplink-place6"} => 4 @[1593961699.969] |
||
| 96 | {job="core-services"} => 3 @[1593961699.969] |
||
| 97 | [17:08:19] server1.place11:/etc/prometheus# |
||
| 98 | |||
| 99 | </pre> |
||
| 100 | |||
| 101 | |||
| 102 | Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4 |
||
| 103 | |||
| 104 | * probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619] |
||
| 105 | |||
| 106 | The operator @on@ is used to filter |
||
| 107 | |||
| 108 | <pre> |
||
| 109 | sum(probe_success * on(instance) probe_ip_protocol == 4) |
||
| 110 | </pre> |
||
| 111 | |||
| 112 | |||
| 113 | Creating an alert: |
||
| 114 | |||
| 115 | * if the sum of all jobs of a certain regex and match on ip protocol is 0 |
||
| 116 | ** this particular job indicates total loss of connectivity |
||
| 117 | * We want to get a vector like this: |
||
| 118 | ** job="routers-place5", protocol = 4 |
||
| 119 | ** job="uplink-place5", protocol = 4 |
||
| 120 | ** job="routers-place5", protocol = 6 |
||
| 121 | ** job="uplink-place5", protocol = 6 |
||
| 122 | |||
| 123 | 9 | Nico Schottelius | |
| 124 | 12 | Nico Schottelius | Query for IPv4 of all routers: |
| 125 | |||
| 126 | <pre> |
||
| 127 | [17:09:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' |
||
| 128 | {job="routers-place5"} => 8 @[1593963562.281] |
||
| 129 | {job="routers-place6'"} => 8 @[1593963562.281] |
||
| 130 | </pre> |
||
| 131 | |||
| 132 | Query for all IPv4 of all routers: |
||
| 133 | |||
| 134 | <pre> |
||
| 135 | [17:39:22] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' |
||
| 136 | {job="routers-place5"} => 12 @[1593963626.483] |
||
| 137 | {job="routers-place6'"} => 12 @[1593963626.483] |
||
| 138 | [17:40:26] server1.place11:/etc/prometheus# |
||
| 139 | </pre> |
||
| 140 | |||
| 141 | Query for all IPv6 uplinks: |
||
| 142 | |||
| 143 | <pre> |
||
| 144 | [17:40:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' |
||
| 145 | {job="uplink-place5"} => 12 @[1593963675.835] |
||
| 146 | {job="uplink-place6"} => 12 @[1593963675.835] |
||
| 147 | [17:41:15] server1.place11:/etc/prometheus# |
||
| 148 | </pre> |
||
| 149 | |||
| 150 | |||
| 151 | Query for all IPv4 uplinks: |
||
| 152 | |||
| 153 | <pre> |
||
| 154 | [17:41:15] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' |
||
| 155 | {job="uplink-place5"} => 8 @[1593963698.108] |
||
| 156 | {job="uplink-place6"} => 8 @[1593963698.108] |
||
| 157 | |||
| 158 | </pre> |
||
| 159 | |||
| 160 | The values 8 and 12 means: |
||
| 161 | |||
| 162 | * 8 = 4 (ip version 4) * probe_success (2 routers are up) |
||
| 163 | * 8 = 6 (ip version 6) * probe_success (2 routers are up) |
||
| 164 | |||
| 165 | To normalise, we would need to divide by 4 (or 6): |
||
| 166 | |||
| 167 | <pre> |
||
| 168 | [17:41:38] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4' |
||
| 169 | {job="uplink-place5"} => 2 @[1593963778.885] |
||
| 170 | {job="uplink-place6"} => 2 @[1593963778.885] |
||
| 171 | [17:42:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6' |
||
| 172 | {job="uplink-place5"} => 2 @[1593963788.276] |
||
| 173 | {job="uplink-place6"} => 2 @[1593963788.276] |
||
| 174 | </pre> |
||
| 175 | |||
| 176 | However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0. |
||
| 177 | |||
| 178 | 9 | Nico Schottelius | h3. Using Grafana |
| 179 | |||
| 180 | * Username for changing items: "admin" |
||
| 181 | * Username for viewing dashboards: "ungleich" |
||
| 182 | * Passwords in the password store |
||
| 183 | 10 | Nico Schottelius | |
| 184 | h3. Managing alerts |
||
| 185 | |||
| 186 | * Read https://prometheus.io/docs/practices/alerting/ as an introduction |
||
| 187 | * Use @amtool@ |
||
| 188 | |||
| 189 | Showing current alerts: |
||
| 190 | |||
| 191 | <pre> |
||
| 192 | 19 | Nico Schottelius | # Alpine needs URL (why?) |
| 193 | amtool alert query --alertmanager.url=http://localhost:9093 |
||
| 194 | |||
| 195 | # Debian |
||
| 196 | amtool alert query |
||
| 197 | </pre> |
||
| 198 | |||
| 199 | |||
| 200 | <pre> |
||
| 201 | 10 | Nico Schottelius | [14:54:35] monitoring.place6:~# amtool alert query |
| 202 | Alertname Starts At Summary |
||
| 203 | InstanceDown 2020-07-01 10:24:03 CEST Instance red1.place5.ungleich.ch down |
||
| 204 | InstanceDown 2020-07-01 10:24:03 CEST Instance red3.place5.ungleich.ch down |
||
| 205 | InstanceDown 2020-07-05 12:51:03 CEST Instance apu-router2.place5.ungleich.ch down |
||
| 206 | UngleichServiceDown 2020-07-05 13:51:19 CEST Ungleich internal service https://staging.swiss-crowdfunder.com down |
||
| 207 | InstanceDown 2020-07-05 13:55:33 CEST Instance https://swiss-crowdfunder.com down |
||
| 208 | CephHealthSate 2020-07-05 13:59:49 CEST Ceph Cluster is not healthy. |
||
| 209 | LinthalHigh 2020-07-05 14:01:41 CEST Temperature on risinghf-19 is 32.10012512207032 |
||
| 210 | [14:54:41] monitoring.place6:~# |
||
| 211 | </pre> |
||
| 212 | |||
| 213 | Silencing alerts: |
||
| 214 | |||
| 215 | <pre> |
||
| 216 | [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate |
||
| 217 | 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa |
||
| 218 | [15:00:06] monitoring.place6:~# amtool silence query |
||
| 219 | ID Matchers Ends At Created By Comment |
||
| 220 | 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa alertname=CephHealthSate 2020-07-05 14:00:06 UTC root Ceph is actually fine |
||
| 221 | [15:00:13] monitoring.place6:~# |
||
| 222 | </pre> |
||
| 223 | |||
| 224 | 1 | Dominique Roux | Better using author and co. TOBEFIXED |
| 225 | 13 | Nico Schottelius | |
| 226 | h3. Severity levels |
||
| 227 | |||
| 228 | The following notions are used: |
||
| 229 | 1 | Dominique Roux | |
| 230 | 14 | Nico Schottelius | * critical = panic = calling to the whole team |
| 231 | 13 | Nico Schottelius | * warning = something needs to be fixed = email to sre, non paging |
| 232 | 14 | Nico Schottelius | * info = not good, might be an indication for fixing something, goes to a matrix room |
| 233 | 15 | Nico Schottelius | |
| 234 | h3. Labeling |
||
| 235 | |||
| 236 | Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some: |
||
| 237 | |||
| 238 | * The @relabel_configs@ are applied BEFORE scraping |
||
| 239 | * The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!) |
||
| 240 | * regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax |
||
| 241 | * metric_label_config does not apply to automatic labels like @up@ ! |
||
| 242 | 1 | Dominique Roux | ** You need to use relabel_configs |
| 243 | 16 | Nico Schottelius | |
| 244 | h3. Setting "roles" |
||
| 245 | |||
| 246 | We use the label "role" to define a primary purpose per host. Example from 2020-07-07: |
||
| 247 | |||
| 248 | <pre> |
||
| 249 | relabel_configs: |
||
| 250 | 1 | Dominique Roux | - source_labels: [__address__] |
| 251 | regex: '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*' |
||
| 252 | target_label: 'role' |
||
| 253 | replacement: '$1' |
||
| 254 | - source_labels: [__address__] |
||
| 255 | regex: 'ciara.*.ungleich.ch.*' |
||
| 256 | target_label: 'role' |
||
| 257 | replacement: 'server' |
||
| 258 | - source_labels: [__address__] |
||
| 259 | regex: '.*:9283' |
||
| 260 | target_label: 'role' |
||
| 261 | replacement: 'ceph' |
||
| 262 | - source_labels: [__address__] |
||
| 263 | regex: '((ciara2|ciara4).*)' |
||
| 264 | target_label: 'role' |
||
| 265 | replacement: 'down' |
||
| 266 | - source_labels: [__address__] |
||
| 267 | regex: '.*(place.*).ungleich.ch.*' |
||
| 268 | target_label: 'dc' |
||
| 269 | replacement: '$1' |
||
| 270 | </pre> |
||
| 271 | |||
| 272 | What happens here: |
||
| 273 | |||
| 274 | * __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100 |
||
| 275 | * We apply some roles by default (the server, monitor etc.) |
||
| 276 | * Special rule for ciara, which does not match the serverX pattern |
||
| 277 | * ciara2 and ciara4 in above example are intentionally down |
||
| 278 | * At the end we setup the "dc" label in case the host is in a place of ungleich |
||
| 279 | |||
| 280 | h3. Marking hosts down |
||
| 281 | |||
| 282 | If a host or service is intentionally down, **change its role** to **down**. |
||
| 283 | |||
| 284 | h3. SMS and Voice notifications |
||
| 285 | |||
| 286 | We use https://ecall.ch. |
||
| 287 | |||
| 288 | * For voice: mail to number@voice.ecall.ch |
||
| 289 | * For voice: mail to number@sms.ecall.ch |
||
| 290 | |||
| 291 | Uses email sender based authorization. |
||
| 292 | |||
| 293 | h3. Alertmanager clusters |
||
| 294 | |||
| 295 | 21 | Nico Schottelius | * The outside monitors form one alertmanager cluster |
| 296 | * The inside monitors form one alertmanager cluster |
||
| 297 | |||
| 298 | h3. Monit |
||
| 299 | |||
| 300 | We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co. |
||
| 301 | |||
| 302 | h3. Service/Customer monitoring |
||
| 303 | |||
| 304 | * A few blackbox things can be found on the datacenter monitoring infrastructure. |
||
| 305 | * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. |
||
| 306 | 27 | Nico Schottelius | |
| 307 | 30 | Nico Schottelius | h2. Monitoring Rules |
| 308 | |||
| 309 | The following is a description of logical rules that (are, need to be, should be) in place. |
||
| 310 | |||
| 311 | h3. External Monitoring/Alerting |
||
| 312 | |||
| 313 | To be able to catch multiple uplink errors, there should be 2 external prometheus systems operating in a cluster for alerting (alertmanagers). |
||
| 314 | The retention period of these monitoring servers can be low, as their main purpose is link down detection. No internal services need to be monitored. |
||
| 315 | |||
| 316 | h3. External Uplink monitoring (IPv6, IPv4) |
||
| 317 | |||
| 318 | * There should be 2 external systems that monitor via ping the two routers per place |
||
| 319 | ** Whether IPv4 and IPv6 are done by the same systems does not matter |
||
| 320 | ** However there need to be |
||
| 321 | *** 2 for IPv4 (place4, place7) |
||
| 322 | 1 | Dominique Roux | *** 2 for IPv6 (place4, ?) |
| 323 | 31 | Nico Schottelius | * If all uplinks of one place are down for at least 5m, we send out an emergency alert |
| 324 | 30 | Nico Schottelius | |
| 325 | h3. External DNS monitoring (IPv6, IPv4) |
||
| 326 | |||
| 327 | * There should be 2 external systems that monitor whether our authoritative DNS servers are working |
||
| 328 | 1 | Dominique Roux | * We query whether ipv4.ungleich.ch resolves to an IPv4 address |
| 329 | * We query whether ipv6.ungleich.ch resolves to an IPv6 address |
||
| 330 | 31 | Nico Schottelius | * If all external servers fail for 5m, we send out an emergency alert |
| 331 | 1 | Dominique Roux | |
| 332 | 31 | Nico Schottelius | h3. Internal ceph monitors |
| 333 | |||
| 334 | * Monitor whether there is a quorom |
||
| 335 | ** If there is no quorum for at least 15m, we send out an emergency alert |
||
| 336 | |||
| 337 | h3. Monitoring monitoring |
||
| 338 | |||
| 339 | * The internal monitors monitor whether the external monitors are reachable |
||
| 340 | 30 | Nico Schottelius | |
| 341 | 27 | Nico Schottelius | h2. Typical tasks |
| 342 | |||
| 343 | h3. Adding customer monitoring |
||
| 344 | |||
| 345 | Customers can have their own alerts. By default, if customer resources are monitored, we ... |
||
| 346 | |||
| 347 | 29 | Nico Schottelius | * If we do not have access to the VM: ask the user to setup prometheus node exporter and whitelist port 9100 to be accessible from 2a0a:e5c0:2:2:0:c8ff:fe68:bf3b |
| 348 | * Otherwise do above step ourselves |
||
| 349 | 27 | Nico Schottelius | * ensure the customer has an LDAP account |
| 350 | 29 | Nico Schottelius | ** Ask the user to login with their LDAP user to https://monitoring-v3.ungleich.ch/ - this way grafana knows about the user (similar to redmine) |
| 351 | 27 | Nico Schottelius | * create a folder on grafana of https://monitoring-v3.ungleich.ch/ with the same name as the LDAP user (for instance "nicocustomer") |
| 352 | * Modify the permissions of the folder |
||
| 353 | ** Remove the standard Viewer Role |
||
| 354 | ** Add User -> the LDAP user -> View |
||
| 355 | |||
| 356 | 28 | Nico Schottelius | Setup a dashboard. If it allows selecting nodes: |
| 357 | |||
| 358 | * Limit the variable by defining the regex in the dashboard settings |
||
| 359 | |||
| 360 | 27 | Nico Schottelius | If the user requested alerts |
| 361 | |||
| 362 | * Configure them in cdist, type __dcl_monitoring_server2020/files/prometheus-v3/ |
||
| 363 | 29 | Nico Schottelius | |
| 364 | Finally: |
||
| 365 | |||
| 366 | <pre> |
||
| 367 | cdist config -v monitoring-v3.ungleich.ch |
||
| 368 | </pre> |
||
| 369 | 20 | Nico Schottelius | |
| 370 | h2. Old Monitoring |
||
| 371 | |||
| 372 | Before 2020-07 our monitoring incorporated more services/had a different approach: |
||
| 373 | |||
| 374 | |||
| 375 | We used the following technology / products for the monitoring: |
||
| 376 | |||
| 377 | * consul (service discovery) |
||
| 378 | * prometheus (exporting, gathering, alerting) |
||
| 379 | * Grafana (presenting) |
||
| 380 | |||
| 381 | Prometheus and grafana are located on the monitoring control machines |
||
| 382 | |||
| 383 | * monitoring.place5.ungleich.ch |
||
| 384 | * monitoring.place6.ungleich.ch |
||
| 385 | |||
| 386 | The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink. |
||
| 387 | |||
| 388 | |||
| 389 | h3. Consul |
||
| 390 | |||
| 391 | We used a consul cluster for each datacenter (e.g. place5 and place6). |
||
| 392 | The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) |
||
| 393 | |||
| 394 | consul is configured to publish the service its host is providing (e.g. the exporters) |
||
| 395 | |||
| 396 | There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] |
||
| 397 | |||
| 398 | Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory. |
||
| 399 | |||
| 400 | h3. Authentication |
||
| 401 | |||
| 402 | The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]]) |
||
| 403 | All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers |
||
| 404 | |||
| 405 | This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure. |