Project

General

Profile

The ungleich monitoring infrastructure » History » Revision 34

Revision 33 (Dominique Roux, 05/07/2021 08:38 AM) → Revision 34/35 (Nico Schottelius, 12/29/2021 01:35 PM)

h1. The ungleich monitoring infrastructure 

 {{>toc}} 

 h2. Monitoring Guide 

 We are using prometheus, grafana, blackbox_exporter and monit for monitoring. 

 h3. Architecture v2 overview 

 * There is *1 internal IPv6 only* monitoring system *per place* 
 ** emonitor1.place5.ungleich.ch (real hardware) 
 ** emonitor1.place6.ungleich.ch (real hardware) 
 ** *Main role: alert if services are down* 
 * There is *1 external dual stack* monitoring system 
 ** monitoring.place4.ungleich.ch 
 ** *Main role: alert if one or more places are unreachable from outside* 
 ** Also monitors all nodes to be have all data available 
 * There is *1 customer enabled* monitoring system 
 ** monitoring-v3.ungleich.ch 
 ** Uses LDAP 
 ** Runs on a VM 
 * There are *many monitored* systems 
 * Systems can be marked as intentionally down (but still kept monitored) 
 * Monitoring systems are built with the least amount of external dependencies 

 h3. Monitoring and Alerting workflow 

 * Once per day the SRE team checks the relevant dashboards 
 ** Are systems down that should not be? 
 ** Is there a trend visible of systems failing? 
 * If the monitoring system sent a notification about a failed system 
 ** The SRE team fixes it the same day if possible 
 * If the monitoring system sent a critical error message 
 ** Instant fixes are to be applied by the SRE team 

 h3. Adding a new production system 

 * Install the correct exporter (often: node_exporter) 
 * Limit access via nftables 

 h3. Configuring prometheus 

 Use @promtool check config@ to verify the configuration. 

 <pre> 
 [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml  
 Checking /etc/prometheus/prometheus.yml 
   SUCCESS: 4 rule files found 

 Checking /etc/prometheus/blackbox.rules 
   SUCCESS: 3 rules found 

 Checking /etc/prometheus/ceph-alerts.rules 
   SUCCESS: 8 rules found 

 Checking /etc/prometheus/node-alerts.rules 
   SUCCESS: 8 rules found 

 Checking /etc/prometheus/uplink-monitoring.rules 
   SUCCESS: 1 rules found 

 </pre> 

 

 h3. Configuring emonitors 

 <pre> 
 cdist config -bj7 -p3 -vv emonitor1.place{5,6,7}.ungleich.ch 
 </pre> 

 

 h3. Querying prometheus 

 Use @promtool query instant@ to query values: 

 <pre> 
 [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' 
 probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] 
 probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] 
 probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] 
 probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] 
 </pre> 

 Typical queries: 

 Creating a sum of all metrics that contains a common label. For instance summing over all jobs: 

 <pre> 
 sum by (job) (probe_success) 

 [17:07:58] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum by (job) (probe_success) 
 ' 
 {job="routers-place5"} => 4 @[1593961699.969] 
 {job="uplink-place5"} => 4 @[1593961699.969] 
 {job="routers-place6'"} => 4 @[1593961699.969] 
 {job="uplink-place6"} => 4 @[1593961699.969] 
 {job="core-services"} => 3 @[1593961699.969] 
 [17:08:19] server1.place11:/etc/prometheus#  

 </pre> 


 Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4 

 * probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619] 

 The operator @on@ is used to filter 

 <pre> 
 sum(probe_success * on(instance) probe_ip_protocol == 4) 
 </pre> 


 Creating an alert: 

 * if the sum of all jobs of a certain regex and match on ip protocol is 0 
 ** this particular job indicates total loss of connectivity 
 * We want to get a vector like this: 
 ** job="routers-place5", protocol = 4  
 ** job="uplink-place5", protocol = 4  
 ** job="routers-place5", protocol = 6  
 ** job="uplink-place5", protocol = 6 


 Query for IPv4 of all routers: 

 <pre> 
 [17:09:26] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' 
 {job="routers-place5"} => 8 @[1593963562.281] 
 {job="routers-place6'"} => 8 @[1593963562.281] 
 </pre> 

 Query for all IPv4 of all routers: 

 <pre> 
 [17:39:22] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' 
 {job="routers-place5"} => 12 @[1593963626.483] 
 {job="routers-place6'"} => 12 @[1593963626.483] 
 [17:40:26] server1.place11:/etc/prometheus#  
 </pre> 

 Query for all IPv6 uplinks: 

 <pre> 
 [17:40:26] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' 
 {job="uplink-place5"} => 12 @[1593963675.835] 
 {job="uplink-place6"} => 12 @[1593963675.835] 
 [17:41:15] server1.place11:/etc/prometheus#  
 </pre> 


 Query for all IPv4 uplinks: 

 <pre> 
 [17:41:15] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' 
 {job="uplink-place5"} => 8 @[1593963698.108] 
 {job="uplink-place6"} => 8 @[1593963698.108] 

 </pre> 

 The values 8 and 12 means: 

 * 8 = 4 (ip version 4) * probe_success (2 routers are up) 
 * 8 = 6 (ip version 6) * probe_success (2 routers are up) 

 To normalise, we would need to divide by 4 (or 6): 

 <pre> 
 [17:41:38] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4' 
 {job="uplink-place5"} => 2 @[1593963778.885] 
 {job="uplink-place6"} => 2 @[1593963778.885] 
 [17:42:58] server1.place11:/etc/prometheus# promtool    query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6' 
 {job="uplink-place5"} => 2 @[1593963788.276] 
 {job="uplink-place6"} => 2 @[1593963788.276] 
 </pre> 

 However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0. 

 h3. Using Grafana 

 * Username for changing items: "admin" 
 * Username for viewing dashboards: "ungleich" 
 * Passwords in the password store 

 h3. Managing alerts 

 * Read https://prometheus.io/docs/practices/alerting/ as an introduction 
 * Use @amtool@ 

 Showing current alerts: 

 <pre> 
 # Alpine needs URL (why?) 
 amtool alert query --alertmanager.url=http://localhost:9093 

 # Debian 
 amtool alert query 
 </pre> 


 <pre> 
 [14:54:35] monitoring.place6:~# amtool alert query 
 Alertname              Starts At                   Summary                                                                
 InstanceDown           2020-07-01 10:24:03 CEST    Instance red1.place5.ungleich.ch down                                  
 InstanceDown           2020-07-01 10:24:03 CEST    Instance red3.place5.ungleich.ch down                                  
 InstanceDown           2020-07-05 12:51:03 CEST    Instance apu-router2.place5.ungleich.ch down                           
 UngleichServiceDown    2020-07-05 13:51:19 CEST    Ungleich internal service https://staging.swiss-crowdfunder.com down   
 InstanceDown           2020-07-05 13:55:33 CEST    Instance https://swiss-crowdfunder.com down                            
 CephHealthSate         2020-07-05 13:59:49 CEST    Ceph Cluster is not healthy.                                           
 LinthalHigh            2020-07-05 14:01:41 CEST    Temperature on risinghf-19 is 32.10012512207032                        
 [14:54:41] monitoring.place6:~#  
 </pre> 

 Silencing alerts: 

 <pre> 
 [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate 
 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa 
 [15:00:06] monitoring.place6:~# amtool silence query 
 ID                                      Matchers                    Ends At                    Created By    Comment                 
 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa    alertname=CephHealthSate    2020-07-05 14:00:06 UTC    root          Ceph is actually fine   
 [15:00:13] monitoring.place6:~#  
 </pre> 

 Better using author and co. TOBEFIXED 

 h3. Severity levels 

 The following notions are used: 

 * critical = panic = calling to the whole team 
 * warning = something needs to be fixed = email to sre, non paging 
 * info = not good, might be an indication for fixing something, goes to a matrix room 

 h3. Labeling 

 Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some: 

 * The @relabel_configs@ are applied BEFORE scraping 
 * The @metric_relabel_configs@ are applied AFTER scraping (contains different labels!) 
 * regular expression are not the "default" RE, but "RE2":https://github.com/google/re2/wiki/Syntax 
 * metric_label_config does not apply to automatic labels like @up@ ! 
 ** You need to use relabel_configs 

 h3. Setting "roles" 

 We use the label "role" to define a primary purpose per host. Example from 2020-07-07: 

 <pre> 
     relabel_configs: 
       - source_labels: [__address__] 
         regex:           '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*' 
         target_label:    'role' 
         replacement:     '$1' 
       - source_labels: [__address__] 
         regex:           'ciara.*.ungleich.ch.*' 
         target_label:    'role' 
         replacement:     'server' 
       - source_labels: [__address__] 
         regex:           '.*:9283' 
         target_label:    'role' 
         replacement:     'ceph' 
       - source_labels: [__address__] 
         regex:           '((ciara2|ciara4).*)' 
         target_label:    'role' 
         replacement:     'down' 
       - source_labels: [__address__] 
         regex:           '.*(place.*).ungleich.ch.*' 
         target_label:    'dc' 
         replacement:     '$1' 
 </pre> 

 What happens here: 

 * __address__ contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100 
 * We apply some roles by default (the server, monitor etc.) 
 * Special rule for ciara, which does not match the serverX pattern 
 * ciara2 and ciara4 in above example are intentionally down 
 * At the end we setup the "dc" label in case the host is in a place of ungleich 

 h3. Marking hosts down 

 If a host or service is intentionally down, **change its role** to **down**. 

 h3. SMS and Voice notifications 

 We use https://ecall.ch. 

 * For voice: mail to number@voice.ecall.ch 
 * For voice: mail to number@sms.ecall.ch 

 Uses email sender based authorization. 

 h3. Alertmanager clusters 

 * The outside monitors form one alertmanager cluster 
 * The inside monitors form one alertmanager cluster 

 h3. Monit 

 We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. This is very similar to supervise and co. 

 h3. Service/Customer monitoring 

 * A few blackbox things can be found on the datacenter monitoring infrastructure. 
 * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. 

 h2. Monitoring Rules 

 The following is a description of logical rules that (are, need to be, should be) in place. 

 h3. External Monitoring/Alerting 

 To be able to catch multiple uplink errors, there should be 2 external prometheus systems operating in a cluster for alerting (alertmanagers). 
 The retention period of these monitoring servers can be low, as their main purpose is link down detection. No internal services need to be monitored. 
 
 h3. External Uplink monitoring (IPv6, IPv4) 

 * There should be 2 external systems that monitor via ping the two routers per place 
 ** Whether IPv4 and IPv6 are done by the same systems does not matter 
 ** However there need to be  
 *** 2 for IPv4 (place4, place7)  
 *** 2 for IPv6 (place4, ?) 
 * If all uplinks of one place are down for at least 5m, we send out an emergency alert 

 h3. External DNS monitoring (IPv6, IPv4) 

 * There should be 2 external systems that monitor whether our authoritative DNS servers are working 
 * We query whether ipv4.ungleich.ch resolves to an IPv4 address 
 * We query whether ipv6.ungleich.ch resolves to an IPv6 address 
 * If all external servers fail for 5m, we send out an emergency alert 

 h3. Internal ceph monitors 

 * Monitor whether there is a quorom 
 ** If there is no quorum for at least 15m, we send out an emergency alert 

 h3. Monitoring monitoring 

 * The internal monitors monitor whether the external monitors are reachable  

 h2. Typical tasks 

 h3. Adding customer monitoring 

 Customers can have their own alerts. By default, if customer resources are monitored, we ... 

 * If we do not have access to the VM: ask the user to setup prometheus node exporter and whitelist port 9100 to be accessible from 2a0a:e5c0:2:2:0:c8ff:fe68:bf3b 
 * Otherwise do above step ourselves 
 * ensure the customer has an LDAP account 
 ** Ask the user to login with their LDAP user to https://monitoring-v3.ungleich.ch/ - this way grafana knows about the user (similar to redmine) 
 * create a folder on grafana of https://monitoring-v3.ungleich.ch/ with the same name as the LDAP user (for instance "nicocustomer") 
 * Modify the permissions of the folder 
 ** Remove the standard Viewer Role 
 ** Add User -> the LDAP user -> View 

 Setup a dashboard. If it allows selecting nodes: 

 * Limit the variable by defining the regex in the dashboard settings 

 If the user requested alerts 

 * Configure them in cdist, type __dcl_monitoring_server2020/files/prometheus-v3/ 

 Finally: 

 <pre> 
 cdist config -v monitoring-v3.ungleich.ch 
 </pre> 

 h2. Old Monitoring 

 Before 2020-07 our monitoring incorporated more services/had a different approach: 


 We used the following technology / products for the monitoring: 

 * consul (service discovery) 
 * prometheus (exporting, gathering, alerting) 
 * Grafana (presenting) 

 Prometheus and grafana are located on the monitoring control machines 

 * monitoring.place5.ungleich.ch 
 * monitoring.place6.ungleich.ch 

 The monitoring machines above are now being replaced by emonitor1.place5.ungleich.ch and    emonitor1.place6.ungleich.ch. The difference is that the new machines are independent from ceph and have a dedicated uplink. 


 h3. Consul 

 We used a consul cluster for each datacenter (e.g. place5 and place6).  
 The servers are still located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) 

 consul is configured to publish the service its host is providing (e.g. the exporters) 

 There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] 

 Consul has some drawbacks (nodes leaving the cluster -> node by default not monitored anymore) and the advantage of fully dynamic monitoring is not a big advantage for physical machines of which we already have an inventory. 

 h3. Authentication 

 The grafana authentication worked over ldap. (See [[The ungleich LDAP guide]]) 
 All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers 

 This was retired and monitoring servers have static usernames to be independent of the LDAP infrastructure.