Actions

History

The ungleich monitoring infrastructure » History » Revision 19

« Previous | Revision 19/35 (diff) | Next »
Nico Schottelius, 07/24/2020 11:44 AM

The ungleich monitoring infrastructure¶

Introduction¶

We use the following technology / products for the monitoring:

consul (service discovery)
prometheus (exporting, gathering, alerting)
Grafana (presenting)

Prometheus and grafana are located on the monitoring control machines

monitoring.place5.ungleich.ch
monitoring.place6.ungleich.ch

Consul¶

We use a consul cluster for each datacenter (e.g. place5 and place6).
The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)

consul is configured to publish the service its host is providing (e.g. the exporters)

There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]

Prometheus¶

Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)

Exporters¶

Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
blackbox (Metrics about online state of http/https services)

The node exporter is located on all monitored hosts
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
The blackbox exporter is located on the monitoring control machine itself.

Alerts¶

We configured the following alerts:

ceph osds down
ceph health state is not OK
ceph quorum not OK
ceph pool disk usage too high
ceph disk usage too high
instance down
disk usage too high
Monitored website down

Grafana¶

Grafana provides dashboards for the following:

Node (metrics about CPU-, RAM-, Disk and so on usage)
blackbox (metrics about the blackbox exporter)
ceph (important metrics from the ceph exporter)

Authentication¶

The grafana authentication works over ldap. (See The ungleich LDAP guide)
All users in the devops group will be mapped to the Admin role, all other users will be Viewers

Monit¶

We use monit for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist.

Misc¶

You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff.
This page needs some love!

Service/Customer monitoring¶

A few blackbox things can be found on the datacenter monitoring infrastructure.
There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @Timothée Floure for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.

Monitoring Guide¶

Configuring prometheus¶

Use promtool check config to verify the configuration.

[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
Checking /etc/prometheus/prometheus.yml
  SUCCESS: 4 rule files found

Checking /etc/prometheus/blackbox.rules
  SUCCESS: 3 rules found

Checking /etc/prometheus/ceph-alerts.rules
  SUCCESS: 8 rules found

Checking /etc/prometheus/node-alerts.rules
  SUCCESS: 8 rules found

Checking /etc/prometheus/uplink-monitoring.rules
  SUCCESS: 1 rules found

Querying prometheus¶

Use promtool query instant to query values:

[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]

Typical queries:

Creating a sum of all metrics that contains a common label. For instance summing over all jobs:

sum by (job) (probe_success)

[17:07:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum by (job) (probe_success)
'
{job="routers-place5"} => 4 @[1593961699.969]
{job="uplink-place5"} => 4 @[1593961699.969]
{job="routers-place6'"} => 4 @[1593961699.969]
{job="uplink-place6"} => 4 @[1593961699.969]
{job="core-services"} => 3 @[1593961699.969]
[17:08:19] server1.place11:/etc/prometheus#

Combining different metrics for filtering. For instance to filter all metrics of type "probe_success" which also have a metric probe_ip_protocol with value = 4

probe_ip_protocol{dc="place5", instance="147.78.195.249", job="routers-place5", protocol="ipv4"} => 4 @[1593961766.619]

The operator on is used to filter

sum(probe_success * on(instance) probe_ip_protocol == 4)

Creating an alert:

if the sum of all jobs of a certain regex and match on ip protocol is 0
- this particular job indicates total loss of connectivity
We want to get a vector like this:
- job="routers-place5", protocol = 4
- job="uplink-place5", protocol = 4
- job="routers-place5", protocol = 6
- job="uplink-place5", protocol = 6

Query for IPv4 of all routers:

[17:09:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
{job="routers-place5"} => 8 @[1593963562.281]
{job="routers-place6'"} => 8 @[1593963562.281]

Query for all IPv4 of all routers:

[17:39:22] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"routers-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
{job="routers-place5"} => 12 @[1593963626.483]
{job="routers-place6'"} => 12 @[1593963626.483]
[17:40:26] server1.place11:/etc/prometheus#

Query for all IPv6 uplinks:

[17:40:26] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job)'
{job="uplink-place5"} => 12 @[1593963675.835]
{job="uplink-place6"} => 12 @[1593963675.835]
[17:41:15] server1.place11:/etc/prometheus#

Query for all IPv4 uplinks:

[17:41:15] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job)'
{job="uplink-place5"} => 8 @[1593963698.108]
{job="uplink-place6"} => 8 @[1593963698.108]

The values 8 and 12 means:

8 = 4 (ip version 4) * probe_success (2 routers are up)
8 = 6 (ip version 6) * probe_success (2 routers are up)

To normalise, we would need to divide by 4 (or 6):

[17:41:38] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 4) by (job) / 4'
{job="uplink-place5"} => 2 @[1593963778.885]
{job="uplink-place6"} => 2 @[1593963778.885]
[17:42:58] server1.place11:/etc/prometheus# promtool  query instant http://localhost:9090 'sum(probe_success{job=~"uplink-.*"} * on(instance) group_left(job) probe_ip_protocol == 6) by (job) / 6'
{job="uplink-place5"} => 2 @[1593963788.276]
{job="uplink-place6"} => 2 @[1593963788.276]

However if we are only interested in whether 0 are up, it does not matter as 0*4 = 0 and 0*6 = 0.

Using Grafana¶

Username for changing items: "admin"
Username for viewing dashboards: "ungleich"
Passwords in the password store

Managing alerts¶

Read https://prometheus.io/docs/practices/alerting/ as an introduction
Use amtool

Showing current alerts:

# Alpine needs URL (why?)
amtool alert query --alertmanager.url=http://localhost:9093

# Debian
amtool alert query

[14:54:35] monitoring.place6:~# amtool alert query
Alertname            Starts At                 Summary                                                               
InstanceDown         2020-07-01 10:24:03 CEST  Instance red1.place5.ungleich.ch down                                 
InstanceDown         2020-07-01 10:24:03 CEST  Instance red3.place5.ungleich.ch down                                 
InstanceDown         2020-07-05 12:51:03 CEST  Instance apu-router2.place5.ungleich.ch down                          
UngleichServiceDown  2020-07-05 13:51:19 CEST  Ungleich internal service https://staging.swiss-crowdfunder.com down  
InstanceDown         2020-07-05 13:55:33 CEST  Instance https://swiss-crowdfunder.com down                           
CephHealthSate       2020-07-05 13:59:49 CEST  Ceph Cluster is not healthy.                                          
LinthalHigh          2020-07-05 14:01:41 CEST  Temperature on risinghf-19 is 32.10012512207032                       
[14:54:41] monitoring.place6:~#

Silencing alerts:

[14:59:45] monitoring.place6:~# amtool silence add -c "Ceph is actually fine" alertname=CephHealthSate
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa
[15:00:06] monitoring.place6:~# amtool silence query
ID                                    Matchers                  Ends At                  Created By  Comment                
4a5c65ff-4af3-4dc9-a6e0-5754b00cd2fa  alertname=CephHealthSate  2020-07-05 14:00:06 UTC  root        Ceph is actually fine  
[15:00:13] monitoring.place6:~#

Better using author and co. TOBEFIXED

Severity levels¶

The following notions are used:

critical = panic = calling to the whole team
warning = something needs to be fixed = email to sre, non paging
info = not good, might be an indication for fixing something, goes to a matrix room

Labeling¶

Labeling in Prometheus is a science on its own and has a lot of pitfalls. Let's start with some:

The relabel_configs are applied BEFORE scraping
The metric_relabel_configs are applied AFTER scraping (contains different labels!)
regular expression are not the "default" RE, but RE2
metric_label_config does not apply to automatic labels like up !
- You need to use relabel_configs

Setting "roles"¶

We use the label "role" to define a primary purpose per host. Example from 2020-07-07:

    relabel_configs:
      - source_labels: [__address__]
        regex:         '.*(server|monitor|canary-vm|vpn|server|apu-router|router).*.ungleich.ch.*'
        target_label:  'role'
        replacement:   '$1'
      - source_labels: [__address__]
        regex:         'ciara.*.ungleich.ch.*'
        target_label:  'role'
        replacement:   'server'
      - source_labels: [__address__]
        regex:         '.*:9283'
        target_label:  'role'
        replacement:   'ceph'
      - source_labels: [__address__]
        regex:         '((ciara2|ciara4).*)'
        target_label:  'role'
        replacement:   'down'
      - source_labels: [__address__]
        regex:         '.*(place.*).ungleich.ch.*'
        target_label:  'dc'
        replacement:   '$1'

What happens here:

address contains the hostname+port, f.i. server1.placeX.ungleich.ch:9100
We apply some roles by default (the server, monitor etc.)
Special rule for ciara, which does not match the serverX pattern
ciara2 and ciara4 in above example are intentionally down
At the end we setup the "dc" label in case the host is in a place of ungleich

Marking hosts down¶

If a host or service is intentionally down, change its role to down.

SMS and Voice notifications¶

We use https://ecall.ch.

For voice: mail to number@voice.ecall.ch
For voice: mail to number@sms.ecall.ch

Uses email sender based authorization.

Alertmanager clusters¶

The outside monitors form one alertmanager cluster
The inside monitors form one alertmanager cluster

Files (0)

Updated by Nico Schottelius about 5 years ago · 19 revisions

Project

General

Profile

Open Infrastructure

Wiki