Project

General

Profile

The ungleich monitoring infrastructure » History » Version 2

Dominique Roux, 04/20/2019 09:12 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5
h2. Introduction
6
7 2 Dominique Roux
We use the following technology / products for the monitoring:
8
9
* consul (service discovery)
10
* prometheus (exporting, gathering, alerting)
11
* Grafana (presenting)
12
13 1 Dominique Roux
h2. Consul
14
15 2 Dominique Roux
We use a consul cluster for each datacenter (e.g. place5 and place6). 
16
The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
17
18
consul is configured to publish the service its host is providing (e.g. the exporters)
19
20
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
21
22 1 Dominique Roux
h2. Prometheus
23 2 Dominique Roux
24
Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)
25
26
h3. Exporters
27
28
* Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
29
* Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
30
* blackbox (Metrics about online state of http/https services)
31
32
The node exporter is located on all monitored hosts
33
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
34
The blackbox exporter is located on the monitoring control machine itself.
35
36
h3. Alerts
37
38
We configured the following alerts:
39
40
* ceph osds down
41
* ceph health state is not OK
42
* ceph quorum not OK
43
* ceph pool disk usage too high
44
* ceph disk usage too high
45
* instance down
46
* disk usage too high
47
* Monitored website down
48 1 Dominique Roux
49
h2. Grafana