The ungleich monitoring infrastructure 2024 » History » Revision 5
Revision 4 (Nico Schottelius, 12/18/2023 11:33 AM) → Revision 5/8 (Nico Schottelius, 12/21/2023 01:17 PM)
h1. The ungleich monitoring infrastructure 2024 (WIP)
{{toc}}
h2. Intro
This is a work-in-progress update from [[The_ungleich_monitoring_infrastructure]]. The infrastructure is still based on prometheus + blackbox exporter, but now also makes use of kubernetes native objects.
h2. Monitoring definition
h3. External primary router/link monitoring
* Objective: find out from an external PoV whether the lines are functioning
* Implementation:
** Collecting/alerting with prometheus on place12
** blackbox on place12
** blackbox on place11
* Targets
** ipv6/router1.place10/snr
** ipv4/router1.place10/snr
** ipv6/server12X.place10/snr
** ipv4/server12X.place10/snr
h3. External primary router
* Objective: find out whether a router is reachable via any path
* Implementation:
** Collecting/alerting with prometheus on place12
** blackbox on place12
** blackbox on place11
h3. Test external monitoring
* Objective: find out whether the external monitoring is alive
* Implementation:
** Collecting/alerting with prometheus on place10
* Targets
** ipv6/emonitor1.place12/prometheus
** ipv6/emonitor1.place12/blackbox
** ipv6/emonitor1.place12/alertmanager
** ipv6/vm1.place11/blackbox
h3. Test per place monitoring infrastructure (blackbox exporter, prometheus)
Each place should provide a blackbox exporter suitable for monitoring onsite targets.
We need to ensure that these blackbox exporters all function and that prometheus instances are up.
* Objective: find out whether the onsite monitoring is alive
* Implementation:
** Collecting/alerting with prometheus on place12
* Targets
** blackbox-exporter + prometheus/place5
** blackbox-exporter + prometheus/place6
** blackbox-exporter + prometheus/place10
h3. Internal internal router monitoring (TBD)
Per place monitor internal routers
* Objective: find out whether the internal monitoring is alive
* Implementation:
** Collecting/alerting with prometheus on ...
* Targets
** ipv6/apu-router1.place6 (via place10/blackbox)
h3. Internal network device monitoring
* Objective: find out whether all production switches are alive
* Implementation:
** Dedicated blackbox_exporter on a router or similar (needs to be secured)
* Targets
** All Arista in each place
** All Mikrotik in each place