Project

General

Profile

Actions

The ungleich monitoring infrastructure » History » Revision 9

« Previous | Revision 9/35 (diff) | Next »
Nico Schottelius, 07/05/2020 11:51 AM


The ungleich monitoring infrastructure

Introduction

We use the following technology / products for the monitoring:

  • consul (service discovery)
  • prometheus (exporting, gathering, alerting)
  • Grafana (presenting)

Prometheus and grafana are located on the monitoring control machines

  • monitoring.place5.ungleich.ch
  • monitoring.place6.ungleich.ch

Consul

We use a consul cluster for each datacenter (e.g. place5 and place6).
The servers are located on the physical machines (red{1..3} resp. black{1..3}) and the agents are running on all other monitored machines (such as servers and VMs)

consul is configured to publish the service its host is providing (e.g. the exporters)

There is a inter-datacenter communication (wan gossip) [https://www.consul.io/docs/guides/datacenters.html]

Prometheus

Prometheus is responsible to get all data out (exporters) of the monitored host and store them. Also to send out alerts if needed (alertmanager)

Exporters

  • Node (host specific metrics (e.g. CPU-, RAM-, Disk-usage..))
  • Ceph (Ceph specific metrics (e.g. pool usage, osds ..))
  • blackbox (Metrics about online state of http/https services)

The node exporter is located on all monitored hosts
Ceph exporter is porvided by ceph itself and is located on the ceph manager.
The blackbox exporter is located on the monitoring control machine itself.

Alerts

We configured the following alerts:

  • ceph osds down
  • ceph health state is not OK
  • ceph quorum not OK
  • ceph pool disk usage too high
  • ceph disk usage too high
  • instance down
  • disk usage too high
  • Monitored website down

Grafana

Grafana provides dashboards for the following:

  • Node (metrics about CPU-, RAM-, Disk and so on usage)
  • blackbox (metrics about the blackbox exporter)
  • ceph (important metrics from the ceph exporter)

Authentication

The grafana authentication works over ldap. (See The ungleich LDAP guide)
All users in the devops group will be mapped to the Admin role, all other users will be Viewers

Monit

We use monit for monitoring and restarting daemons. See `__ungleich_monit` type in dot-cdist.

Misc

  • You're probably looking for the `__dcl_monitoring_server` type, which centralize a bunch of stuff.
  • This page needs some love!

Service/Customer monitoring

  • A few blackbox things can be found on the datacenter monitoring infrastructure.
  • There's a new prometheus+grafana setup at https://service-monitoring.ungleich.ch/, deployed by @Timothée Floure for Matrix-as-a-Service monitoring. At time of writing, it also monitors the VPN server and staticwebhosting. No alertmanager yet. Partially manual.

Monitoring Guide

Configuring prometheus

Use promtool check config to verify the configuration.

[21:02:48] server1.place11:~# promtool check config /etc/prometheus/prometheus.yml 
Checking /etc/prometheus/prometheus.yml
  SUCCESS: 4 rule files found

Checking /etc/prometheus/blackbox.rules
  SUCCESS: 3 rules found

Checking /etc/prometheus/ceph-alerts.rules
  SUCCESS: 8 rules found

Checking /etc/prometheus/node-alerts.rules
  SUCCESS: 8 rules found

Checking /etc/prometheus/uplink-monitoring.rules
  SUCCESS: 1 rules found

Querying prometheus

Use promtool query instant to query values:

[21:00:26] server1.place11:~# promtool query instant http://localhost:9090 'probe_success{dc="place5"} == 1'
probe_success{dc="place5", instance="193.192.225.73", job="routers-place5", protocol="ipv4", sensiblehostname="router1"} => 1 @[1593889492.577]
probe_success{dc="place5", instance="195.141.230.103", job="routers-place5", protocol="ipv4", sensiblehostname="router2"} => 1 @[1593889492.577]
probe_success{dc="place5", instance="2001:1700:3500::12", job="routers-place5", protocol="ipv6", sensiblehostname="router2"} => 1 @[1593889492.577]
probe_success{dc="place5", instance="2001:1700:3500::2", job="routers-place5", protocol="ipv6", sensiblehostname="router1"} => 1 @[1593889492.577]

Using Grafana

  • Username for changing items: "admin"
  • Username for viewing dashboards: "ungleich"
  • Passwords in the password store

Updated by Nico Schottelius over 4 years ago · 9 revisions