Project

General

Profile

The ungleich monitoring infrastructure 2024 » History » Version 7

Nico Schottelius, 12/25/2023 01:47 PM

1 1 Nico Schottelius
h1. The ungleich monitoring infrastructure 2024 (WIP)
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Intro
6
7
This is a work-in-progress update from [[The_ungleich_monitoring_infrastructure]]. The infrastructure is still based on prometheus + blackbox exporter, but now also makes use of kubernetes native objects.
8
9
h2. Monitoring definition
10
11
h3. External primary router/link monitoring
12
13
* Objective: find out from an external PoV whether the lines are functioning
14
* Implementation:
15
** Collecting/alerting with prometheus on place12
16
** blackbox on place12
17
** blackbox on place11
18 2 Nico Schottelius
* Targets
19
** ipv6/router1.place10/snr
20
** ipv4/router1.place10/snr
21
** ipv6/server12X.place10/snr
22
** ipv4/server12X.place10/snr
23 6 Nico Schottelius
** ipv4/fiberstream/place5
24
** ipv4/fiberstream/place6
25
** ipv4/fiberstream/place7
26
** ipv4/fiberstream/place10
27
* Status: TBD
28 2 Nico Schottelius
29 7 Nico Schottelius
h3. Main DNS servers
30
31
* Objective: ensure all 3 DNS servers are running and returning queries
32
* Implementation:
33
** Collecting/alerting with prometheus on place12
34
** blackbox on place12
35
** blackbox on place11
36
* Targets
37
** dns1.ungleich.ch
38
** dns2.ungleich.ch
39
** dns3.ungleich.ch
40
* Status: TBD
41
42 2 Nico Schottelius
h3. External primary router
43
44
* Objective: find out whether a router is reachable via any path
45
* Implementation:
46
** Collecting/alerting with prometheus on place12
47
** blackbox on place12
48
** blackbox on place11
49 7 Nico Schottelius
* Status: TBD
50
51 2 Nico Schottelius
52
h3. Test external monitoring
53
54
* Objective: find out whether the external monitoring is alive
55
* Implementation:
56
** Collecting/alerting with prometheus on place10
57
* Targets
58
** ipv6/emonitor1.place12/prometheus
59
** ipv6/emonitor1.place12/blackbox
60
** ipv6/emonitor1.place12/alertmanager
61
** ipv6/vm1.place11/blackbox
62 4 Nico Schottelius
63
h3. Test per place monitoring infrastructure (blackbox exporter, prometheus)
64
65
Each place should provide a blackbox exporter suitable for monitoring onsite targets. 
66
We need to ensure that these blackbox exporters all function and that prometheus instances are up.
67
68
* Objective: find out whether the onsite monitoring is alive
69
* Implementation:
70
** Collecting/alerting with prometheus on place12
71
* Targets
72
** blackbox-exporter + prometheus/place5
73
** blackbox-exporter + prometheus/place6
74
** blackbox-exporter + prometheus/place10
75
76
77
h3. Internal internal router monitoring (TBD)
78
79
Per place monitor internal routers
80
81
* Objective: find out whether the internal monitoring is alive
82
* Implementation:
83
** Collecting/alerting with prometheus on ...
84
* Targets
85
** ipv6/apu-router1.place6 (via place10/blackbox)
86 5 Nico Schottelius
87
h3. Internal network device monitoring
88
89
* Objective: find out whether all production switches are alive
90
* Implementation:
91
** Dedicated blackbox_exporter on a router or similar (needs to be secured)
92
* Targets
93
** All Arista in each place
94
** All Mikrotik in each place