The ungleich monitoring infrastructure » History » Version 2
Dominique Roux, 04/20/2019 09:12 PM
| 1 | 1 | Dominique Roux | h1. The ungleich monitoring infrastructure |
|---|---|---|---|
| 2 | |||
| 3 | {{>toc}} |
||
| 4 | |||
| 5 | h2. Introduction |
||
| 6 | |||
| 7 | 2 | Dominique Roux | We use the following technology / products for the monitoring: |
| 8 | |||
| 9 | * consul (service discovery) |
||
| 10 | * prometheus (exporting, gathering, alerting) |
||
| 11 | * Grafana (presenting) |
||
| 12 | |||
| 13 | 1 | Dominique Roux | h2. Consul |
| 14 | |||
| 15 | 2 | Dominique Roux | We use a consul cluster for each datacenter (e.g. place5 and place6). |
| 16 | The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) |
||
| 17 | |||
| 18 | consul is configured to publish the service its host is providing (e.g. the exporters) |
||
| 19 | |||
| 20 | There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] |
||
| 21 | |||
| 22 | 1 | Dominique Roux | h2. Prometheus |
| 23 | 2 | Dominique Roux | |
| 24 | Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager) |
||
| 25 | |||
| 26 | h3. Exporters |
||
| 27 | |||
| 28 | * Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..)) |
||
| 29 | * Ceph (Ceph specific metrics (e.g. pool usage, osds ..)) |
||
| 30 | * blackbox (Metrics about online state of http/https services) |
||
| 31 | |||
| 32 | The node exporter is located on all monitored hosts |
||
| 33 | Ceph exporter is porvided by ceph itself and is located on the ceph manager. |
||
| 34 | The blackbox exporter is located on the monitoring control machine itself. |
||
| 35 | |||
| 36 | h3. Alerts |
||
| 37 | |||
| 38 | We configured the following alerts: |
||
| 39 | |||
| 40 | * ceph osds down |
||
| 41 | * ceph health state is not OK |
||
| 42 | * ceph quorum not OK |
||
| 43 | * ceph pool disk usage too high |
||
| 44 | * ceph disk usage too high |
||
| 45 | * instance down |
||
| 46 | * disk usage too high |
||
| 47 | * Monitored website down |
||
| 48 | 1 | Dominique Roux | |
| 49 | h2. Grafana |