Project

General

Profile

The ungleich monitoring infrastructure 2024 » History » Version 8

Nico Schottelius, 12/25/2023 01:51 PM

1 1 Nico Schottelius
h1. The ungleich monitoring infrastructure 2024 (WIP)
2
3 3 Nico Schottelius
{{toc}}
4
5 1 Nico Schottelius
h2. Intro
6
7
This is a work-in-progress update from [[The_ungleich_monitoring_infrastructure]]. The infrastructure is still based on prometheus + blackbox exporter, but now also makes use of kubernetes native objects.
8
9
h2. Monitoring definition
10
11
h3. External primary router/link monitoring
12
13
* Objective: find out from an external PoV whether the lines are functioning
14
* Implementation:
15
** Collecting/alerting with prometheus on place12
16
** blackbox on place12
17
** blackbox on place11
18 2 Nico Schottelius
* Targets
19
** ipv6/router1.place10/snr
20
** ipv4/router1.place10/snr
21
** ipv6/server12X.place10/snr
22
** ipv4/server12X.place10/snr
23 6 Nico Schottelius
** ipv4/fiberstream/place5
24
** ipv4/fiberstream/place6
25
** ipv4/fiberstream/place7
26
** ipv4/fiberstream/place10
27
* Status: TBD
28 2 Nico Schottelius
29 7 Nico Schottelius
h3. Main DNS servers
30
31
* Objective: ensure all 3 DNS servers are running and returning queries
32
* Implementation:
33
** Collecting/alerting with prometheus on place12
34
** blackbox on place12
35
** blackbox on place11
36
* Targets
37
** dns1.ungleich.ch
38
** dns2.ungleich.ch
39
** dns3.ungleich.ch
40
* Status: TBD
41
42 2 Nico Schottelius
h3. External primary router
43
44
* Objective: find out whether a router is reachable via any path
45
* Implementation:
46
** Collecting/alerting with prometheus on place12
47
** blackbox on place12
48
** blackbox on place11
49 8 Nico Schottelius
* Targets
50
** genauso/r2
51
** genauso/r3
52
** p5/server137
53
** p5/server138
54
** p10/router1
55
** p10/server122
56
** p10/server123
57
** p15/server120
58
** p15/server121
59 1 Nico Schottelius
60 8 Nico Schottelius
* Status: TBD
61 2 Nico Schottelius
62
h3. Test external monitoring
63
64
* Objective: find out whether the external monitoring is alive
65
* Implementation:
66
** Collecting/alerting with prometheus on place10
67
* Targets
68
** ipv6/emonitor1.place12/prometheus
69
** ipv6/emonitor1.place12/blackbox
70
** ipv6/emonitor1.place12/alertmanager
71
** ipv6/vm1.place11/blackbox
72 4 Nico Schottelius
73
h3. Test per place monitoring infrastructure (blackbox exporter, prometheus)
74
75
Each place should provide a blackbox exporter suitable for monitoring onsite targets. 
76
We need to ensure that these blackbox exporters all function and that prometheus instances are up.
77
78
* Objective: find out whether the onsite monitoring is alive
79
* Implementation:
80
** Collecting/alerting with prometheus on place12
81
* Targets
82
** blackbox-exporter + prometheus/place5
83
** blackbox-exporter + prometheus/place6
84
** blackbox-exporter + prometheus/place10
85
86
87
h3. Internal internal router monitoring (TBD)
88
89
Per place monitor internal routers
90
91
* Objective: find out whether the internal monitoring is alive
92
* Implementation:
93
** Collecting/alerting with prometheus on ...
94
* Targets
95
** ipv6/apu-router1.place6 (via place10/blackbox)
96 5 Nico Schottelius
97
h3. Internal network device monitoring
98
99
* Objective: find out whether all production switches are alive
100
* Implementation:
101
** Dedicated blackbox_exporter on a router or similar (needs to be secured)
102
* Targets
103
** All Arista in each place
104
** All Mikrotik in each place