Version 26 - History - The ungleich monitoring infrastructure - Open Infrastructure - ungleich redmine

The ungleich monitoring infrastructure » History » Version 26

Nico Schottelius, 08/13/2020 10:56 PM

-Dominique Roux
+h1. The ungleich monitoring infrastructure
 {{>toc}}
-Nico Schottelius
+h2. Monitoring Guide
-Nico Schottelius
+We are using prometheus, grafana, blackbox_exporter and monit for monitoring.
-Nico Schottelius
+h3. Architecture overview
 * There is *1 internal IPv6 only* monitoring system *per place*
 ** emonitor1.place5.ungleich.ch (real hardware)
 ** emonitor1.place6.ungleich.ch (real hardware)
 ** *Main role: alert if services are down*
 * There is *1 external dual stack* monitoring system
 ** monitoring.place4.ungleich.ch
 ** *Main role: alert if one or more places are unreachable from outside*
-Nico Schottelius
+** Also monitors all nodes to be have all data available
-Nico Schottelius
+* There are *many monitored* systems
 * Systems can be marked as intentionally down (but still kept monitored)
-Nico Schottelius
+* Monitoring systems are built with the least amount of external dependencies
 Nico Schottelius
-Nico Schottelius
+h3. Monitoring and Alerting workflow
 * Once per day the SRE team checks the relevant dashboards
 ** Are systems down that should not be?
 ** Is there a trend visible of systems failing?
 * If the monitoring system sent a notification about a failed system
 ** The SRE team fixes it the same day if possible
 * If the monitoring system sent a critical error message
 ** Instant fixes are to be applied by the SRE team
-Nico Schottelius
+h3. Adding a new production system
 * Install the correct exporter (often: node_exporter)
 * Limit access via nftables
 Nico Schottelius
-Nico Schottelius
+h3. Configuring prometheus
 Use @promtool check config@ to verify the configuration.
 <pre>
 [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml
 Checking /etc/prometheus/prometheus.yml
   SUCCESS: 4 rule files found
 Checking /etc/prometheus/blackbox.rules
   SUCCESS: 3 rules found
 Checking /etc/prometheus/ceph-alerts.rules
   SUCCESS: 8 rules found
 Checking /etc/prometheus/node-alerts.rules
   SUCCESS: 8 rules found
 Checking /etc/prometheus/uplink-monitoring.rules
   SUCCESS: 1 rules found
 </pre>
 Nico Schottelius
 h3. Querying prometheus
 Use @promtool query instant@ to query values:
 <pre>
 [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
 probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
 probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
 probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
 probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
 </pre>
 Nico Schottelius
-Nico Schottelius
+Typical queries:
 Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
 <pre>
 sum by (job) (probe_success)
 [17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
+'
 {job="routers-place5"} => 4 @[1593961699.969]
 {job="uplink-place5"} => 4 @[1593961699.969]
 {job="routers-place6'"} => 4 @[1593961699.969]
 {job="uplink-place6"} => 4 @[1593961699.969]
 {job="core-services"} => 3 @[1593961699.969]
 [17:08:19] server1.place11:/etc/prometheus#
 </pre>
 Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
 * probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
 The operator @on@ is used to filter
 <pre>
 sum(probe_success * on(instance) probe_ip_protocol == 4)
 </pre>
 Creating an alert:
 * if the sum of all jobs of a certain regex and match on ip protocol is 0
 ** this particular job indicates total loss of connectivity
 * We want to get a vector like this:
 ** job="routers-place5", protocol = 4
 ** job="uplink-place5", protocol = 4
 ** job="routers-place5", protocol = 6
 ** job="uplink-place5", protocol = 6
 Nico Schottelius
-Nico Schottelius
+Query for IPv4 of all routers:
 <pre>
 [17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
 {job="routers-place5"} => 8 @[1593963562.281]
 {job="routers-place6'"} => 8 @[1593963562.281]
 </pre>
 Query for all IPv4 of all routers:
 <pre>
 [17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
 {job="routers-place5"} => 12 @[1593963626.483]
 {job="routers-place6'"} => 12 @[1593963626.483]
 [17:40:26] server1.place11:/etc/prometheus#
 </pre>
 Query for all IPv6 uplinks:
 <pre>
 [17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
 {job="uplink-place5"} => 12 @[1593963675.835]
 {job="uplink-place6"} => 12 @[1593963675.835]
 [17:41:15] server1.place11:/etc/prometheus#
 </pre>
 Query for all IPv4 uplinks:
 <pre>
 [17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
 {job="uplink-place5"} => 8 @[1593963698.108]
 {job="uplink-place6"} => 8 @[1593963698.108]
 </pre>
 The values 8 and 12 means:
 * 8 = 4 (ip version 4) * probe_success (2 routers are up)
 * 8 = 6 (ip version 6) * probe_success (2 routers are up)
 To normalise, we would need to divide by 4 (or 6):
 <pre>
 [17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
 {job="uplink-place5"} => 2 @[1593963778.885]
 {job="uplink-place6"} => 2 @[1593963778.885]
 [17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
 {job="uplink-place5"} => 2 @[1593963788.276]
 {job="uplink-place6"} => 2 @[1593963788.276]
 </pre>
 However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
-Nico Schottelius
+h3. Using Grafana
 * Username for changing items: "admin"
 * Username for viewing dashboards: "ungleich"
 * Passwords in the password store
 Nico Schottelius
 h3. Managing alerts
 * Read https://prometheus.io/docs/practices/alerting/ as an introduction
 * Use @amtool@
 Showing current alerts:
 <pre>
-Nico Schottelius
+# Alpine needs URL (why?)
 amtool alert query --alertmanager.url=http://localhost:9093
 # Debian
 amtool alert query
 </pre>
 <pre>
-Nico Schottelius
+[14:54:35] monitoring.place6:~# amtool alert query
 Alertname            Starts At                 Summary
 InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down
 InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down
 InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down
 UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down
 InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down
 CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.
 LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032
 [14:54:41] monitoring.place6:~#
 </pre>
 Silencing alerts:
 <pre>
 [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
 a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
 [15:00:06] monitoring.place6:~# amtool silence query
 ID                                    Matchers                  Ends At                  Created By  Comment
 a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine
 [15:00:13] monitoring.place6:~#
 </pre>
-Dominique Roux
+Better using author and co. TOBEFIXED
 Nico Schottelius
 h3. Severity levels
 The following notions are used:
 Dominique Roux
-Nico Schottelius
+* critical = panic = calling to the whole team
-Nico Schottelius
+* warning = something needs to be fixed = email to sre, non paging
-Nico Schottelius
+* info = not good, might be an indication for fixing something, goes to a matrix room
 Nico Schottelius
 h3. Labeling
 Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:
 * The @relabel_configs@ are applied BEFORE scraping
 * The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!)
 * regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax
 * metric_label_config does not apply to automatic labels like @up@ !
-Dominique Roux
+** You need to use relabel_configs
 Nico Schottelius
 h3. Setting "roles"
 We use the label "role" to define a primary purpose per host. Example from 2020-07-07:
 <pre>
     relabel_configs:
-Dominique Roux
+      - source_labels: [__address__]
         regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
         target_label:  'role'
         replacement:   '$1'
       - source_labels: [__address__]
         regex:         'ciara.*.ungleich.ch.*'
         target_label:  'role'
         replacement:   'server'
       - source_labels: [__address__]
         regex:         '.*:9283'
         target_label:  'role'
         replacement:   'ceph'
       - source_labels: [__address__]
         regex:         '((ciara2|ciara4).*)'
         target_label:  'role'
         replacement:   'down'
       - source_labels: [__address__]
         regex:         '.*(place.*).ungleich.ch.*'
         target_label:  'dc'
         replacement:   '$1'
 </pre>
 What happens here:
 * __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
 * We apply some roles by default (the server, monitor etc.)
 * Special rule for ciara, which does not match the serverX pattern
 * ciara2 and ciara4 in above example are intentionally down
 * At the end we setup the "dc" label in case the host is in a place of ungleich
 h3. Marking hosts down
 If a host or service is intentionally down, **change its role** to **down**.
 h3. SMS and Voice notifications
 We use https://ecall.ch.
 * For voice: mail to number@voice.ecall.ch
 * For voice: mail to number@sms.ecall.ch
 Uses email sender based authorization.
 h3. Alertmanager clusters
-Nico Schottelius
+* The outside monitors form one alertmanager cluster
 * The inside monitors form one alertmanager cluster
 h3. Monit
 We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co.
 h3. Service/Customer monitoring
 * A few blackbox things can be found on the datacenter monitoring infrastructure.
 * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
 Nico Schottelius
 h2. Old Monitoring
 Before 2020-07 our monitoring incorporated more services/had a different approach:
 We used the following technology / products for the monitoring:
 * consul (service discovery)
 * prometheus (exporting, gathering, alerting)
 * Grafana (presenting)
 Prometheus and grafana are located on the monitoring control machines
 * monitoring.place5.ungleich.ch
 * monitoring.place6.ungleich.ch
 The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and  emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink.
 h3. Consul
 We used a consul cluster for each datacenter (e.g. place5 and place6).
 The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
 consul is configured to publish the service its host is providing (e.g. the exporters)
 There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
 Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory.
 h3. Authentication
 The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]])
 All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
 This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.

Project

General

Profile

Open Infrastructure

The ungleich monitoring infrastructure » History » Version 26