Project

General

Profile

The ungleich monitoring infrastructure » History » Revision 10

Revision 9 (Nico Schottelius, 07/05/2020 11:51 AM) → Revision 10/35 (Nico Schottelius, 07/05/2020 04:20 PM)

h1. The ungleich monitoring infrastructure 

 {{>toc}} 

 h2. Introduction 

 We use the following technology / products for the monitoring: 

 * consul (service discovery) 
 * prometheus (exporting, gathering, alerting) 
 * Grafana (presenting) 

 Prometheus and grafana are located on the monitoring control machines 

 * monitoring.place5.ungleich.ch 
 * monitoring.place6.ungleich.ch 

 h2. Consul 

 We use a consul cluster for each datacenter (e.g. place5 and place6).  
 The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) 

 consul is configured to publish the service its host is providing (e.g. the exporters) 

 There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] 

 h2. Prometheus 

 Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager) 

 h3. Exporters 

 * Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..)) 
 * Ceph (Ceph specific metrics (e.g. pool usage, osds ..)) 
 * blackbox (Metrics about online state of http/https services) 

 The node exporter is located on all monitored hosts 
 Ceph exporter is porvided by ceph itself and is located on the ceph manager. 
 The blackbox exporter is located on the monitoring control machine itself. 

 h3. Alerts 

 We configured the following alerts: 

 * ceph osds down 
 * ceph health state is not OK 
 * ceph quorum not OK 
 * ceph pool disk usage too high 
 * ceph disk usage too high 
 * instance down 
 * disk usage too high 
 * Monitored website down 

 h2. Grafana 

 Grafana provides dashboards for the following: 

 * Node (metrics about CPU-, RAM-, Disk and so on usage) 
 * blackbox (metrics about the blackbox exporter) 
 * ceph (important metrics from the ceph exporter) 

 h3. Authentication 

 The grafana authentication works over ldap. (See [[The ungleich LDAP guide]]) 
 All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers 

 h2. Monit 

 We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. 

 h2. Misc 

 * You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff. 
 * This page needs some love! 

 h2. Service/Customer monitoring 

 * A few blackbox things can be found on the datacenter monitoring infrastructure. 
 * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. 


 h2. Monitoring Guide 

 h3. Configuring prometheus 

 Use @promtool check config@ to verify the configuration. 

 <pre> 
 [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml  
 Checking /etc/prometheus/prometheus.yml 
   SUCCESS: 4 rule files found 

 Checking /etc/prometheus/blackbox.rules 
   SUCCESS: 3 rules found 

 Checking /etc/prometheus/ceph-alerts.rules 
   SUCCESS: 8 rules found 

 Checking /etc/prometheus/node-alerts.rules 
   SUCCESS: 8 rules found 

 Checking /etc/prometheus/uplink-monitoring.rules 
   SUCCESS: 1 rules found 

 </pre> 

 h3. Querying prometheus 

 Use @promtool query instant@ to query values: 

 <pre> 
 [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' 
 probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] 
 probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] 
 probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] 
 probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] 
 </pre> 


 h3. Using Grafana 

 * Username for changing items: "admin" 
 * Username for viewing dashboards: "ungleich" 
 * Passwords in the password store 

 h3. Managing alerts 

 * Read https://prometheus.io/docs/practices/alerting/ as an introduction 
 * Use @amtool@ 

 Showing current alerts: 

 <pre> 
 [14:54:35] monitoring.place6:~# amtool alert query 
 Alertname              Starts At                   Summary                                                                
 InstanceDown           2020-07-01 10:24:03 CEST    Instance red1.place5.ungleich.ch down                                  
 InstanceDown           2020-07-01 10:24:03 CEST    Instance red3.place5.ungleich.ch down                                  
 InstanceDown           2020-07-05 12:51:03 CEST    Instance apu-router2.place5.ungleich.ch down                           
 UngleichServiceDown    2020-07-05 13:51:19 CEST    Ungleich internal service https://staging.swiss-crowdfunder.com down   
 InstanceDown           2020-07-05 13:55:33 CEST    Instance https://swiss-crowdfunder.com down                            
 CephHealthSate         2020-07-05 13:59:49 CEST    Ceph Cluster is not healthy.                                           
 LinthalHigh            2020-07-05 14:01:41 CEST    Temperature on risinghf-19 is 32.10012512207032                        
 [14:54:41] monitoring.place6:~#  
 </pre> 

 Silencing alerts: 

 <pre> 
 [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate 
 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa 
 [15:00:06] monitoring.place6:~# amtool silence query 
 ID                                      Matchers                    Ends At                    Created By    Comment                 
 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa    alertname=CephHealthSate    2020-07-05 14:00:06 UTC    root          Ceph is actually fine   
 [15:00:13] monitoring.place6:~#  
 </pre> 

 Better using author and co. TOBEFIXED