The ungleich monitoring infrastructure » History » Version 12
Nico Schottelius, 07/05/2020 05:43 PM
1 | 1 | Dominique Roux | h1. The ungleich monitoring infrastructure |
---|---|---|---|
2 | |||
3 | {{>toc}} |
||
4 | |||
5 | h2. Introduction |
||
6 | |||
7 | 2 | Dominique Roux | We use the following technology / products for the monitoring: |
8 | |||
9 | * consul (service discovery) |
||
10 | * prometheus (exporting, gathering, alerting) |
||
11 | * Grafana (presenting) |
||
12 | |||
13 | 3 | Dominique Roux | Prometheus and grafana are located on the monitoring control machines |
14 | |||
15 | * monitoring.place5.ungleich.ch |
||
16 | * monitoring.place6.ungleich.ch |
||
17 | |||
18 | 1 | Dominique Roux | h2. Consul |
19 | |||
20 | 2 | Dominique Roux | We use a consul cluster for each datacenter (e.g. place5 and place6). |
21 | The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) |
||
22 | |||
23 | consul is configured to publish the service its host is providing (e.g. the exporters) |
||
24 | |||
25 | There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] |
||
26 | |||
27 | 1 | Dominique Roux | h2. Prometheus |
28 | 2 | Dominique Roux | |
29 | Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager) |
||
30 | |||
31 | h3. Exporters |
||
32 | |||
33 | * Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..)) |
||
34 | * Ceph (Ceph specific metrics (e.g. pool usage, osds ..)) |
||
35 | * blackbox (Metrics about online state of http/https services) |
||
36 | |||
37 | The node exporter is located on all monitored hosts |
||
38 | Ceph exporter is porvided by ceph itself and is located on the ceph manager. |
||
39 | The blackbox exporter is located on the monitoring control machine itself. |
||
40 | |||
41 | h3. Alerts |
||
42 | |||
43 | We configured the following alerts: |
||
44 | |||
45 | * ceph osds down |
||
46 | * ceph health state is not OK |
||
47 | * ceph quorum not OK |
||
48 | * ceph pool disk usage too high |
||
49 | * ceph disk usage too high |
||
50 | * instance down |
||
51 | * disk usage too high |
||
52 | * Monitored website down |
||
53 | 1 | Dominique Roux | |
54 | h2. Grafana |
||
55 | 3 | Dominique Roux | |
56 | Grafana provides dashboards for the following: |
||
57 | |||
58 | * Node (metrics about CPU-, RAM-, Disk and so on usage) |
||
59 | * blackbox (metrics about the blackbox exporter) |
||
60 | * ceph (important metrics from the ceph exporter) |
||
61 | |||
62 | h3. Authentication |
||
63 | |||
64 | 4 | Dominique Roux | The grafana authentication works over ldap. (See [[The ungleich LDAP guide]]) |
65 | 3 | Dominique Roux | All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers |
66 | 5 | Timothée Floure | |
67 | h2. Monit |
||
68 | |||
69 | We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. |
||
70 | |||
71 | h2. Misc |
||
72 | |||
73 | * You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff. |
||
74 | * This page needs some love! |
||
75 | 6 | Timothée Floure | |
76 | 7 | Nico Schottelius | h2. Service/Customer monitoring |
77 | 6 | Timothée Floure | |
78 | * A few blackbox things can be found on the datacenter monitoring infrastructure. |
||
79 | 1 | Dominique Roux | * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. |
80 | 7 | Nico Schottelius | |
81 | |||
82 | h2. Monitoring Guide |
||
83 | |||
84 | h3. Configuring prometheus |
||
85 | |||
86 | Use @promtool check config@ to verify the configuration. |
||
87 | |||
88 | <pre> |
||
89 | [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml |
||
90 | Checking /etc/prometheus/prometheus.yml |
||
91 | SUCCESS: 4 rule files found |
||
92 | |||
93 | Checking /etc/prometheus/blackbox.rules |
||
94 | SUCCESS: 3 rules found |
||
95 | |||
96 | Checking /etc/prometheus/ceph-alerts.rules |
||
97 | SUCCESS: 8 rules found |
||
98 | |||
99 | Checking /etc/prometheus/node-alerts.rules |
||
100 | SUCCESS: 8 rules found |
||
101 | |||
102 | Checking /etc/prometheus/uplink-monitoring.rules |
||
103 | SUCCESS: 1 rules found |
||
104 | |||
105 | </pre> |
||
106 | 8 | Nico Schottelius | |
107 | h3. Querying prometheus |
||
108 | |||
109 | Use @promtool query instant@ to query values: |
||
110 | |||
111 | <pre> |
||
112 | [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' |
||
113 | probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
114 | probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
115 | probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] |
||
116 | probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] |
||
117 | </pre> |
||
118 | 9 | Nico Schottelius | |
119 | 11 | Nico Schottelius | Typical queries: |
120 | |||
121 | Creating a sum of all metrics that contains a common label. For instance summing over all jobs: |
||
122 | |||
123 | <pre> |
||
124 | sum by (job) (probe_success) |
||
125 | |||
126 | [17:07:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum by (job) (probe_success) |
||
127 | ' |
||
128 | {job="routers-place5"} => 4 @[1593961699.969] |
||
129 | {job="uplink-place5"} => 4 @[1593961699.969] |
||
130 | {job="routers-place6'"} => 4 @[1593961699.969] |
||
131 | {job="uplink-place6"} => 4 @[1593961699.969] |
||
132 | {job="core-services"} => 3 @[1593961699.969] |
||
133 | [17:08:19] server1.place11:/etc/prometheus# |
||
134 | |||
135 | </pre> |
||
136 | |||
137 | |||
138 | Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4 |
||
139 | |||
140 | * probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619] |
||
141 | |||
142 | The operator @on@ is used to filter |
||
143 | |||
144 | <pre> |
||
145 | sum(probe_success * on(instance) probe_ip_protocol == 4) |
||
146 | </pre> |
||
147 | |||
148 | |||
149 | Creating an alert: |
||
150 | |||
151 | * if the sum of all jobs of a certain regex and match on ip protocol is 0 |
||
152 | ** this particular job indicates total loss of connectivity |
||
153 | * We want to get a vector like this: |
||
154 | ** job="routers-place5", protocol = 4 |
||
155 | ** job="uplink-place5", protocol = 4 |
||
156 | ** job="routers-place5", protocol = 6 |
||
157 | ** job="uplink-place5", protocol = 6 |
||
158 | |||
159 | 9 | Nico Schottelius | |
160 | 12 | Nico Schottelius | Query for IPv4 of all routers: |
161 | |||
162 | <pre> |
||
163 | [17:09:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' |
||
164 | {job="routers-place5"} => 8 @[1593963562.281] |
||
165 | {job="routers-place6'"} => 8 @[1593963562.281] |
||
166 | </pre> |
||
167 | |||
168 | Query for all IPv4 of all routers: |
||
169 | |||
170 | <pre> |
||
171 | [17:39:22] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' |
||
172 | {job="routers-place5"} => 12 @[1593963626.483] |
||
173 | {job="routers-place6'"} => 12 @[1593963626.483] |
||
174 | [17:40:26] server1.place11:/etc/prometheus# |
||
175 | </pre> |
||
176 | |||
177 | Query for all IPv6 uplinks: |
||
178 | |||
179 | <pre> |
||
180 | [17:40:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' |
||
181 | {job="uplink-place5"} => 12 @[1593963675.835] |
||
182 | {job="uplink-place6"} => 12 @[1593963675.835] |
||
183 | [17:41:15] server1.place11:/etc/prometheus# |
||
184 | </pre> |
||
185 | |||
186 | |||
187 | Query for all IPv4 uplinks: |
||
188 | |||
189 | <pre> |
||
190 | [17:41:15] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' |
||
191 | {job="uplink-place5"} => 8 @[1593963698.108] |
||
192 | {job="uplink-place6"} => 8 @[1593963698.108] |
||
193 | |||
194 | </pre> |
||
195 | |||
196 | The values 8 and 12 means: |
||
197 | |||
198 | * 8 = 4 (ip version 4) * probe_success (2 routers are up) |
||
199 | * 8 = 6 (ip version 6) * probe_success (2 routers are up) |
||
200 | |||
201 | To normalise, we would need to divide by 4 (or 6): |
||
202 | |||
203 | <pre> |
||
204 | [17:41:38] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4' |
||
205 | {job="uplink-place5"} => 2 @[1593963778.885] |
||
206 | {job="uplink-place6"} => 2 @[1593963778.885] |
||
207 | [17:42:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6' |
||
208 | {job="uplink-place5"} => 2 @[1593963788.276] |
||
209 | {job="uplink-place6"} => 2 @[1593963788.276] |
||
210 | </pre> |
||
211 | |||
212 | However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0. |
||
213 | |||
214 | 9 | Nico Schottelius | h3. Using Grafana |
215 | |||
216 | * Username for changing items: "admin" |
||
217 | * Username for viewing dashboards: "ungleich" |
||
218 | * Passwords in the password store |
||
219 | 10 | Nico Schottelius | |
220 | h3. Managing alerts |
||
221 | |||
222 | * Read https://prometheus.io/docs/practices/alerting/ as an introduction |
||
223 | * Use @amtool@ |
||
224 | |||
225 | Showing current alerts: |
||
226 | |||
227 | <pre> |
||
228 | [14:54:35] monitoring.place6:~# amtool alert query |
||
229 | Alertname Starts At Summary |
||
230 | InstanceDown 2020-07-01 10:24:03 CEST Instance red1.place5.ungleich.ch down |
||
231 | InstanceDown 2020-07-01 10:24:03 CEST Instance red3.place5.ungleich.ch down |
||
232 | InstanceDown 2020-07-05 12:51:03 CEST Instance apu-router2.place5.ungleich.ch down |
||
233 | UngleichServiceDown 2020-07-05 13:51:19 CEST Ungleich internal service https://staging.swiss-crowdfunder.com down |
||
234 | InstanceDown 2020-07-05 13:55:33 CEST Instance https://swiss-crowdfunder.com down |
||
235 | CephHealthSate 2020-07-05 13:59:49 CEST Ceph Cluster is not healthy. |
||
236 | LinthalHigh 2020-07-05 14:01:41 CEST Temperature on risinghf-19 is 32.10012512207032 |
||
237 | [14:54:41] monitoring.place6:~# |
||
238 | </pre> |
||
239 | |||
240 | Silencing alerts: |
||
241 | |||
242 | <pre> |
||
243 | [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate |
||
244 | 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa |
||
245 | [15:00:06] monitoring.place6:~# amtool silence query |
||
246 | ID Matchers Ends At Created By Comment |
||
247 | 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa alertname=CephHealthSate 2020-07-05 14:00:06 UTC root Ceph is actually fine |
||
248 | [15:00:13] monitoring.place6:~# |
||
249 | </pre> |
||
250 | |||
251 | Better using author and co. TOBEFIXED |