The ungleich monitoring infrastructure » History » Revision 14
Revision 13 (Nico Schottelius, 07/05/2020 07:25 PM) → Revision 14/35 (Nico Schottelius, 07/05/2020 10:30 PM)
h1. The ungleich monitoring infrastructure {{>toc}} h2. Introduction We use the following technology / products for the monitoring: * consul (service discovery) * prometheus (exporting, gathering, alerting) * Grafana (presenting) Prometheus and grafana are located on the monitoring control machines * monitoring.place5.ungleich.ch * monitoring.place6.ungleich.ch h2. Consul We use a consul cluster for each datacenter (e.g. place5 and place6). The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs) consul is configured to publish the service its host is providing (e.g. the exporters) There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html] h2. Prometheus Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager) h3. Exporters * Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..)) * Ceph (Ceph specific metrics (e.g. pool usage, osds ..)) * blackbox (Metrics about online state of http/https services) The node exporter is located on all monitored hosts Ceph exporter is porvided by ceph itself and is located on the ceph manager. The blackbox exporter is located on the monitoring control machine itself. h3. Alerts We configured the following alerts: * ceph osds down * ceph health state is not OK * ceph quorum not OK * ceph pool disk usage too high * ceph disk usage too high * instance down * disk usage too high * Monitored website down h2. Grafana Grafana provides dashboards for the following: * Node (metrics about CPU-, RAM-, Disk and so on usage) * blackbox (metrics about the blackbox exporter) * ceph (important metrics from the ceph exporter) h3. Authentication The grafana authentication works over ldap. (See [[The ungleich LDAP guide]]) All users in the @devops@ group will be mapped to the Admin role, all other users will be Viewers h2. Monit We use "monit":https://mmonit.com/ for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist. h2. Misc * You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff. * This page needs some love! h2. Service/Customer monitoring * A few blackbox things can be found on the datacenter monitoring infrastructure. * There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @fnux for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual. h2. Monitoring Guide h3. Configuring prometheus Use @promtool check config@ to verify the configuration. <pre> [21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml Checking /etc/prometheus/prometheus.yml SUCCESS: 4 rule files found Checking /etc/prometheus/blackbox.rules SUCCESS: 3 rules found Checking /etc/prometheus/ceph-alerts.rules SUCCESS: 8 rules found Checking /etc/prometheus/node-alerts.rules SUCCESS: 8 rules found Checking /etc/prometheus/uplink-monitoring.rules SUCCESS: 1 rules found </pre> h3. Querying prometheus Use @promtool query instant@ to query values: <pre> [21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1' probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577] probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577] probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577] probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577] </pre> Typical queries: Creating a sum of all metrics that contains a common label. For instance summing over all jobs: <pre> sum by (job) (probe_success) [17:07:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum by (job) (probe_success) ' {job="routers-place5"} => 4 @[1593961699.969] {job="uplink-place5"} => 4 @[1593961699.969] {job="routers-place6'"} => 4 @[1593961699.969] {job="uplink-place6"} => 4 @[1593961699.969] {job="core-services"} => 3 @[1593961699.969] [17:08:19] server1.place11:/etc/prometheus# </pre> Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4 * probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619] The operator @on@ is used to filter <pre> sum(probe_success * on(instance) probe_ip_protocol == 4) </pre> Creating an alert: * if the sum of all jobs of a certain regex and match on ip protocol is 0 ** this particular job indicates total loss of connectivity * We want to get a vector like this: ** job="routers-place5", protocol = 4 ** job="uplink-place5", protocol = 4 ** job="routers-place5", protocol = 6 ** job="uplink-place5", protocol = 6 Query for IPv4 of all routers: <pre> [17:09:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' {job="routers-place5"} => 8 @[1593963562.281] {job="routers-place6'"} => 8 @[1593963562.281] </pre> Query for all IPv4 of all routers: <pre> [17:39:22] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' {job="routers-place5"} => 12 @[1593963626.483] {job="routers-place6'"} => 12 @[1593963626.483] [17:40:26] server1.place11:/etc/prometheus# </pre> Query for all IPv6 uplinks: <pre> [17:40:26] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)' {job="uplink-place5"} => 12 @[1593963675.835] {job="uplink-place6"} => 12 @[1593963675.835] [17:41:15] server1.place11:/etc/prometheus# </pre> Query for all IPv4 uplinks: <pre> [17:41:15] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)' {job="uplink-place5"} => 8 @[1593963698.108] {job="uplink-place6"} => 8 @[1593963698.108] </pre> The values 8 and 12 means: * 8 = 4 (ip version 4) * probe_success (2 routers are up) * 8 = 6 (ip version 6) * probe_success (2 routers are up) To normalise, we would need to divide by 4 (or 6): <pre> [17:41:38] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4' {job="uplink-place5"} => 2 @[1593963778.885] {job="uplink-place6"} => 2 @[1593963778.885] [17:42:58] server1.place11:/etc/prometheus# promtool query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6' {job="uplink-place5"} => 2 @[1593963788.276] {job="uplink-place6"} => 2 @[1593963788.276] </pre> However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0. h3. Using Grafana * Username for changing items: "admin" * Username for viewing dashboards: "ungleich" * Passwords in the password store h3. Managing alerts * Read https://prometheus.io/docs/practices/alerting/ as an introduction * Use @amtool@ Showing current alerts: <pre> [14:54:35] monitoring.place6:~# amtool alert query Alertname Starts At Summary InstanceDown 2020-07-01 10:24:03 CEST Instance red1.place5.ungleich.ch down InstanceDown 2020-07-01 10:24:03 CEST Instance red3.place5.ungleich.ch down InstanceDown 2020-07-05 12:51:03 CEST Instance apu-router2.place5.ungleich.ch down UngleichServiceDown 2020-07-05 13:51:19 CEST Ungleich internal service https://staging.swiss-crowdfunder.com down InstanceDown 2020-07-05 13:55:33 CEST Instance https://swiss-crowdfunder.com down CephHealthSate 2020-07-05 13:59:49 CEST Ceph Cluster is not healthy. LinthalHigh 2020-07-05 14:01:41 CEST Temperature on risinghf-19 is 32.10012512207032 [14:54:41] monitoring.place6:~# </pre> Silencing alerts: <pre> [14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa [15:00:06] monitoring.place6:~# amtool silence query ID Matchers Ends At Created By Comment 4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa alertname=CephHealthSate 2020-07-05 14:00:06 UTC root Ceph is actually fine [15:00:13] monitoring.place6:~# </pre> Better using author and co. TOBEFIXED h3. Severity levels The following notions are used: * critical = panic = calling paging to the whole team * warning = something needs to be fixed = email to sre, non paging * info = not good, might be an indication for fixing something, goes to a matrix room