Project

General

Profile

The ungleich monitoring infrastructure » History » Version 6

Timothée Floure, 03/12/2020 03:40 PM
Quickly mention service-monitoring.ungleich.ch

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5
h2. Introduction
6
7 2 Dominique Roux
We use the following technology / products for the monitoring:
8
9
* consul (service discovery)
10
* prometheus (exporting, gathering, alerting)
11
* Grafana (presenting)
12
13 3 Dominique Roux
Prometheus and grafana are located on the monitoring control machines
14
15
* monitoring.place5.ungleich.ch
16
* monitoring.place6.ungleich.ch
17
18 1 Dominique Roux
h2. Consul
19
20 2 Dominique Roux
We use a consul cluster for each datacenter (e.g. place5 and place6). 
21
The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
22
23
consul is configured to publish the service its host is providing (e.g. the exporters)
24
25
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
26
27 1 Dominique Roux
h2. Prometheus
28 2 Dominique Roux
29
Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)
30
31
h3. Exporters
32
33
* Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
34
* Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
35
* blackbox (Metrics about online state of http/https services)
36
37
The node exporter is located on all monitored hosts
38
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
39
The blackbox exporter is located on the monitoring control machine itself.
40
41
h3. Alerts
42
43
We configured the following alerts:
44
45
* ceph osds down
46
* ceph health state is not OK
47
* ceph quorum not OK
48
* ceph pool disk usage too high
49
* ceph disk usage too high
50
* instance down
51
* disk usage too high
52
* Monitored website down
53 1 Dominique Roux
54
h2. Grafana
55 3 Dominique Roux
56
Grafana provides dashboards for the following:
57
58
* Node (metrics about CPU-, RAM-, Disk and so on usage)
59
* blackbox (metrics about the blackbox exporter)
60
* ceph (important metrics from the ceph exporter)
61
62
h3. Authentication
63
64 4 Dominique Roux
The grafana authentication works over ldap. (See [[The ungleich LDAP guide]])
65 3 Dominique Roux
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
66 5 Timothée Floure
67
h2. Monit
68
69
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist.
70
71
h2. Misc
72
73
* You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff.
74
* This page needs some love!
75 6 Timothée Floure
76
h1. Service/Customer monitoring
77
78
* A few blackbox things can be found on the datacenter monitoring infrastructure.
79
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.