The ungleich monitoring infrastructure 2024 » History » Revision 7
Revision 6 (Nico Schottelius, 12/25/2023 12:30 PM) → Revision 7/8 (Nico Schottelius, 12/25/2023 01:47 PM)
h1. The ungleich monitoring infrastructure 2024 (WIP) {{toc}} h2. Intro This is a work-in-progress update from [[The_ungleich_monitoring_infrastructure]]. The infrastructure is still based on prometheus + blackbox exporter, but now also makes use of kubernetes native objects. h2. Monitoring definition h3. External primary router/link monitoring * Objective: find out from an external PoV whether the lines are functioning * Implementation: ** Collecting/alerting with prometheus on place12 ** blackbox on place12 ** blackbox on place11 * Targets ** ipv6/router1.place10/snr ** ipv4/router1.place10/snr ** ipv6/server12X.place10/snr ** ipv4/server12X.place10/snr ** ipv4/fiberstream/place5 ** ipv4/fiberstream/place6 ** ipv4/fiberstream/place7 ** ipv4/fiberstream/place10 * Status: TBD h3. Main DNS servers * Objective: ensure all 3 DNS servers are running and returning queries * Implementation: ** Collecting/alerting with prometheus on place12 ** blackbox on place12 ** blackbox on place11 * Targets ** dns1.ungleich.ch ** dns2.ungleich.ch ** dns3.ungleich.ch * Status: TBD h3. External primary router * Objective: find out whether a router is reachable via any path * Implementation: ** Collecting/alerting with prometheus on place12 ** blackbox on place12 ** blackbox on place11 * Status: TBD h3. Test external monitoring * Objective: find out whether the external monitoring is alive * Implementation: ** Collecting/alerting with prometheus on place10 * Targets ** ipv6/emonitor1.place12/prometheus ** ipv6/emonitor1.place12/blackbox ** ipv6/emonitor1.place12/alertmanager ** ipv6/vm1.place11/blackbox h3. Test per place monitoring infrastructure (blackbox exporter, prometheus) Each place should provide a blackbox exporter suitable for monitoring onsite targets. We need to ensure that these blackbox exporters all function and that prometheus instances are up. * Objective: find out whether the onsite monitoring is alive * Implementation: ** Collecting/alerting with prometheus on place12 * Targets ** blackbox-exporter + prometheus/place5 ** blackbox-exporter + prometheus/place6 ** blackbox-exporter + prometheus/place10 h3. Internal internal router monitoring (TBD) Per place monitor internal routers * Objective: find out whether the internal monitoring is alive * Implementation: ** Collecting/alerting with prometheus on ... * Targets ** ipv6/apu-router1.place6 (via place10/blackbox) h3. Internal network device monitoring * Objective: find out whether all production switches are alive * Implementation: ** Dedicated blackbox_exporter on a router or similar (needs to be secured) * Targets ** All Arista in each place ** All Mikrotik in each place