The ungleich monitoring infrastructure » History » Version 9
Nico Schottelius, 07/05/2020 11:51 AM
| 1 | 1 | Dominique Roux | h1. The ungleich monitoring infrastructure |
|---|---|---|---|
| 2 | |||
| 3 | {{>toc}} |
||
| 4 | |||
| 5 | h2. Introduction |
||
| 6 | |||
| 7 | 2 | Dominique Roux | We use the following technology / products for the monitoring: |
| 8 | |||
| 9 | * consul (service discovery) |
||
| 10 | * prometheus (exporting, gathering, alerting) |
||
| 11 | * Grafana (presenting) |
||
| 12 | |||
| 13 | 3 | Dominique Roux | Prometheus and grafana are located on the monitoring control machines |
| 14 | |||
| 15 | * monitoring.place5.ungleich.ch |
||
| 16 | * monitoring.place6.ungleich.ch |
||
| 17 | |||
| 18 | 1 | Dominique Roux | h2. Consul |
| 19 | |||
| 20 | 2 | Dominique Roux | We use a consul cluster for each datacenter (e.g. place5 and place6). |
| 21 | The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) |
||
| 22 | |||
| 23 | consul is configured to publish the service its host is providing (e.g. the exporters) |
||
| 24 | |||
| 25 | There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] |
||
| 26 | |||
| 27 | 1 | Dominique Roux | h2. Prometheus |
| 28 | 2 | Dominique Roux | |
| 29 | Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager) |
||
| 30 | |||
| 31 | h3. Exporters |
||
| 32 | |||
| 33 | * Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..)) |
||
| 34 | * Ceph (Ceph specific metrics (e.g. pool usage, osds ..)) |
||
| 35 | * blackbox (Metrics about online state of http/https services) |
||
| 36 | |||
| 37 | The node exporter is located on all monitored hosts |
||
| 38 | Ceph exporter is porvided by ceph itself and is located on the ceph manager. |
||
| 39 | The blackbox exporter is located on the monitoring control machine itself. |
||
| 40 | |||
| 41 | h3. Alerts |
||
| 42 | |||
| 43 | We configured the following alerts: |
||
| 44 | |||
| 45 | * ceph osds down |
||
| 46 | * ceph health state is not OK |
||
| 47 | * ceph quorum not OK |
||
| 48 | * ceph pool disk usage too high |
||
| 49 | * ceph disk usage too high |
||
| 50 | * instance down |
||
| 51 | * disk usage too high |
||
| 52 | * Monitored website down |
||
| 53 | 1 | Dominique Roux | |
| 54 | h2. Grafana |
||
| 55 | 3 | Dominique Roux | |
| 56 | Grafana provides dashboards for the following: |
||
| 57 | |||
| 58 | * Node (metrics about CPU-, RAM-, Disk and so on usage) |
||
| 59 | * blackbox (metrics about the blackbox exporter) |
||
| 60 | * ceph (important metrics from the ceph exporter) |
||
| 61 | |||
| 62 | h3. Authentication |
||
| 63 | |||
| 64 | 4 | Dominique Roux | The grafana authentication works over ldap. (See [[The ungleich LDAP guide]]) |
| 65 | 3 | Dominique Roux | All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers |
| 66 | 5 | Timothée Floure | |
| 67 | h2. Monit |
||
| 68 | |||
| 69 | We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. |
||
| 70 | |||
| 71 | h2. Misc |
||
| 72 | |||
| 73 | * You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff. |
||
| 74 | * This page needs some love! |
||
| 75 | 6 | Timothée Floure | |
| 76 | 7 | Nico Schottelius | h2. Service/Customer monitoring |
| 77 | 6 | Timothée Floure | |
| 78 | * A few blackbox things can be found on the datacenter monitoring infrastructure. |
||
| 79 | 1 | Dominique Roux | * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. |
| 80 | 7 | Nico Schottelius | |
| 81 | |||
| 82 | h2. Monitoring Guide |
||
| 83 | |||
| 84 | h3. Configuring prometheus |
||
| 85 | |||
| 86 | Use @promtool check config@ to verify the configuration. |
||
| 87 | |||
| 88 | <pre> |
||
| 89 | [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml |
||
| 90 | Checking /etc/prometheus/prometheus.yml |
||
| 91 | SUCCESS: 4 rule files found |
||
| 92 | |||
| 93 | Checking /etc/prometheus/blackbox.rules |
||
| 94 | SUCCESS: 3 rules found |
||
| 95 | |||
| 96 | Checking /etc/prometheus/ceph-alerts.rules |
||
| 97 | SUCCESS: 8 rules found |
||
| 98 | |||
| 99 | Checking /etc/prometheus/node-alerts.rules |
||
| 100 | SUCCESS: 8 rules found |
||
| 101 | |||
| 102 | Checking /etc/prometheus/uplink-monitoring.rules |
||
| 103 | SUCCESS: 1 rules found |
||
| 104 | |||
| 105 | </pre> |
||
| 106 | 8 | Nico Schottelius | |
| 107 | h3. Querying prometheus |
||
| 108 | |||
| 109 | Use @promtool query instant@ to query values: |
||
| 110 | |||
| 111 | <pre> |
||
| 112 | [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' |
||
| 113 | probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
| 114 | probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
| 115 | probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
| 116 | probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
| 117 | </pre> |
||
| 118 | 9 | Nico Schottelius | |
| 119 | |||
| 120 | h3. Using Grafana |
||
| 121 | |||
| 122 | * Username for changing items: "admin" |
||
| 123 | * Username for viewing dashboards: "ungleich" |
||
| 124 | * Passwords in the password store |