Project

General

Profile

The ungleich monitoring infrastructure » History » Version 12

Nico Schottelius, 07/05/2020 05:43 PM

1 1 Dominique Roux
h1. The ungleich monitoring infrastructure
2
3
{{>toc}}
4
5
h2. Introduction
6
7 2 Dominique Roux
We use the following technology / products for the monitoring:
8
9
* consul (service discovery)
10
* prometheus (exporting, gathering, alerting)
11
* Grafana (presenting)
12
13 3 Dominique Roux
Prometheus and grafana are located on the monitoring control machines
14
15
* monitoring.place5.ungleich.ch
16
* monitoring.place6.ungleich.ch
17
18 1 Dominique Roux
h2. Consul
19
20 2 Dominique Roux
We use a consul cluster for each datacenter (e.g. place5 and place6). 
21
The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)
22
23
consul is configured to publish the service its host is providing (e.g. the exporters)
24
25
There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]
26
27 1 Dominique Roux
h2. Prometheus
28 2 Dominique Roux
29
Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)
30
31
h3. Exporters
32
33
* Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
34
* Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
35
* blackbox (Metrics about online state of http/https services)
36
37
The node exporter is located on all monitored hosts
38
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
39
The blackbox exporter is located on the monitoring control machine itself.
40
41
h3. Alerts
42
43
We configured the following alerts:
44
45
* ceph osds down
46
* ceph health state is not OK
47
* ceph quorum not OK
48
* ceph pool disk usage too high
49
* ceph disk usage too high
50
* instance down
51
* disk usage too high
52
* Monitored website down
53 1 Dominique Roux
54
h2. Grafana
55 3 Dominique Roux
56
Grafana provides dashboards for the following:
57
58
* Node (metrics about CPU-, RAM-, Disk and so on usage)
59
* blackbox (metrics about the blackbox exporter)
60
* ceph (important metrics from the ceph exporter)
61
62
h3. Authentication
63
64 4 Dominique Roux
The grafana authentication works over ldap. (See [[The ungleich LDAP guide]])
65 3 Dominique Roux
All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers
66 5 Timothée Floure
67
h2. Monit
68
69
We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist.
70
71
h2. Misc
72
73
* You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff.
74
* This page needs some love!
75 6 Timothée Floure
76 7 Nico Schottelius
h2. Service/Customer monitoring
77 6 Timothée Floure
78
* A few blackbox things can be found on the datacenter monitoring infrastructure.
79 1 Dominique Roux
* There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.
80 7 Nico Schottelius
81
82
h2. Monitoring Guide
83
84
h3. Configuring prometheus
85
86
Use @promtool check config@ to verify the configuration.
87
88
<pre>
89
[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
90
Checking /etc/prometheus/prometheus.yml
91
  SUCCESS: 4 rule files found
92
93
Checking /etc/prometheus/blackbox.rules
94
  SUCCESS: 3 rules found
95
96
Checking /etc/prometheus/ceph-alerts.rules
97
  SUCCESS: 8 rules found
98
99
Checking /etc/prometheus/node-alerts.rules
100
  SUCCESS: 8 rules found
101
102
Checking /etc/prometheus/uplink-monitoring.rules
103
  SUCCESS: 1 rules found
104
105
</pre>
106 8 Nico Schottelius
107
h3. Querying prometheus
108
109
Use @promtool query instant@ to query values:
110
111
<pre>
112
[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
113
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
114
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
115
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
116
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]
117
</pre>
118 9 Nico Schottelius
119 11 Nico Schottelius
Typical queries:
120
121
Creating a sum of all metrics that contains a common label. For instance summing over all jobs:
122
123
<pre>
124
sum by (job) (probe_success)
125
126
[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
127
'
128
{job="routers-place5"} => 4 @[1593961699.969]
129
{job="uplink-place5"} => 4 @[1593961699.969]
130
{job="routers-place6'"} => 4 @[1593961699.969]
131
{job="uplink-place6"} => 4 @[1593961699.969]
132
{job="core-services"} => 3 @[1593961699.969]
133
[17:08:19] server1.place11:/etc/prometheus# 
134
135
</pre>
136
137
138
Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4
139
140
* probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]
141
142
The operator @on@ is used to filter
143
144
<pre>
145
sum(probe_success * on(instance) probe_ip_protocol == 4)
146
</pre>
147
148
149
Creating an alert:
150
151
* if the sum of all jobs of a certain regex and match on ip protocol is 0
152
** this particular job indicates total loss of connectivity
153
* We want to get a vector like this:
154
** job="routers-place5", protocol = 4 
155
** job="uplink-place5", protocol = 4 
156
** job="routers-place5", protocol = 6 
157
** job="uplink-place5", protocol = 6
158
159 9 Nico Schottelius
160 12 Nico Schottelius
Query for IPv4 of all routers:
161
162
<pre>
163
[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
164
{job="routers-place5"} => 8 @[1593963562.281]
165
{job="routers-place6'"} => 8 @[1593963562.281]
166
</pre>
167
168
Query for all IPv4 of all routers:
169
170
<pre>
171
[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
172
{job="routers-place5"} => 12 @[1593963626.483]
173
{job="routers-place6'"} => 12 @[1593963626.483]
174
[17:40:26] server1.place11:/etc/prometheus# 
175
</pre>
176
177
Query for all IPv6 uplinks:
178
179
<pre>
180
[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
181
{job="uplink-place5"} => 12 @[1593963675.835]
182
{job="uplink-place6"} => 12 @[1593963675.835]
183
[17:41:15] server1.place11:/etc/prometheus# 
184
</pre>
185
186
187
Query for all IPv4 uplinks:
188
189
<pre>
190
[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
191
{job="uplink-place5"} => 8 @[1593963698.108]
192
{job="uplink-place6"} => 8 @[1593963698.108]
193
194
</pre>
195
196
The values 8 and 12 means:
197
198
* 8 = 4 (ip version 4) * probe_success (2 routers are up)
199
* 8 = 6 (ip version 6) * probe_success (2 routers are up)
200
201
To normalise, we would need to divide by 4 (or 6):
202
203
<pre>
204
[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
205
{job="uplink-place5"} => 2 @[1593963778.885]
206
{job="uplink-place6"} => 2 @[1593963778.885]
207
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
208
{job="uplink-place5"} => 2 @[1593963788.276]
209
{job="uplink-place6"} => 2 @[1593963788.276]
210
</pre>
211
212
However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.
213
214 9 Nico Schottelius
h3. Using Grafana
215
216
* Username for changing items: "admin"
217
* Username for viewing dashboards: "ungleich"
218
* Passwords in the password store
219 10 Nico Schottelius
220
h3. Managing alerts
221
222
* Read https://prometheus.io/docs/practices/alerting/ as an introduction
223
* Use @amtool@
224
225
Showing current alerts:
226
227
<pre>
228
[14:54:35] monitoring.place6:~# amtool alert query
229
Alertname            Starts At                 Summary                                                               
230
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
231
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
232
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
233
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
234
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
235
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
236
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
237
[14:54:41] monitoring.place6:~# 
238
</pre>
239
240
Silencing alerts:
241
242
<pre>
243
[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
244
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
245
[15:00:06] monitoring.place6:~# amtool silence query
246
ID                                    Matchers                  Ends At                  Created By  Comment                
247
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
248
[15:00:13] monitoring.place6:~# 
249
</pre>
250
251
Better using author and co. TOBEFIXED