The ungleich monitoring infrastructure 2024 » History » Version 5
Nico Schottelius, 12/21/2023 01:17 PM
| 1 | 1 | Nico Schottelius | h1. The ungleich monitoring infrastructure 2024 (WIP) |
|---|---|---|---|
| 2 | |||
| 3 | 3 | Nico Schottelius | {{toc}} |
| 4 | |||
| 5 | 1 | Nico Schottelius | h2. Intro |
| 6 | |||
| 7 | This is a work-in-progress update from [[The_ungleich_monitoring_infrastructure]]. The infrastructure is still based on prometheus + blackbox exporter, but now also makes use of kubernetes native objects. |
||
| 8 | |||
| 9 | h2. Monitoring definition |
||
| 10 | |||
| 11 | h3. External primary router/link monitoring |
||
| 12 | |||
| 13 | * Objective: find out from an external PoV whether the lines are functioning |
||
| 14 | * Implementation: |
||
| 15 | ** Collecting/alerting with prometheus on place12 |
||
| 16 | ** blackbox on place12 |
||
| 17 | ** blackbox on place11 |
||
| 18 | 2 | Nico Schottelius | * Targets |
| 19 | ** ipv6/router1.place10/snr |
||
| 20 | ** ipv4/router1.place10/snr |
||
| 21 | ** ipv6/server12X.place10/snr |
||
| 22 | ** ipv4/server12X.place10/snr |
||
| 23 | |||
| 24 | h3. External primary router |
||
| 25 | |||
| 26 | * Objective: find out whether a router is reachable via any path |
||
| 27 | * Implementation: |
||
| 28 | ** Collecting/alerting with prometheus on place12 |
||
| 29 | ** blackbox on place12 |
||
| 30 | ** blackbox on place11 |
||
| 31 | |||
| 32 | h3. Test external monitoring |
||
| 33 | |||
| 34 | * Objective: find out whether the external monitoring is alive |
||
| 35 | * Implementation: |
||
| 36 | ** Collecting/alerting with prometheus on place10 |
||
| 37 | * Targets |
||
| 38 | ** ipv6/emonitor1.place12/prometheus |
||
| 39 | ** ipv6/emonitor1.place12/blackbox |
||
| 40 | ** ipv6/emonitor1.place12/alertmanager |
||
| 41 | ** ipv6/vm1.place11/blackbox |
||
| 42 | 4 | Nico Schottelius | |
| 43 | h3. Test per place monitoring infrastructure (blackbox exporter, prometheus) |
||
| 44 | |||
| 45 | Each place should provide a blackbox exporter suitable for monitoring onsite targets. |
||
| 46 | We need to ensure that these blackbox exporters all function and that prometheus instances are up. |
||
| 47 | |||
| 48 | * Objective: find out whether the onsite monitoring is alive |
||
| 49 | * Implementation: |
||
| 50 | ** Collecting/alerting with prometheus on place12 |
||
| 51 | * Targets |
||
| 52 | ** blackbox-exporter + prometheus/place5 |
||
| 53 | ** blackbox-exporter + prometheus/place6 |
||
| 54 | ** blackbox-exporter + prometheus/place10 |
||
| 55 | |||
| 56 | |||
| 57 | h3. Internal internal router monitoring (TBD) |
||
| 58 | |||
| 59 | Per place monitor internal routers |
||
| 60 | |||
| 61 | * Objective: find out whether the internal monitoring is alive |
||
| 62 | * Implementation: |
||
| 63 | ** Collecting/alerting with prometheus on ... |
||
| 64 | * Targets |
||
| 65 | ** ipv6/apu-router1.place6 (via place10/blackbox) |
||
| 66 | 5 | Nico Schottelius | |
| 67 | h3. Internal network device monitoring |
||
| 68 | |||
| 69 | * Objective: find out whether all production switches are alive |
||
| 70 | * Implementation: |
||
| 71 | ** Dedicated blackbox_exporter on a router or similar (needs to be secured) |
||
| 72 | * Targets |
||
| 73 | ** All Arista in each place |
||
| 74 | ** All Mikrotik in each place |