Project

General

Profile

The ungleich monitoring infrastructure 2024 » History » Revision 7

Revision 6 (Nico Schottelius, 12/25/2023 12:30 PM) → Revision 7/8 (Nico Schottelius, 12/25/2023 01:47 PM)

h1. The ungleich monitoring infrastructure 2024 (WIP) 

 {{toc}} 

 h2. Intro 

 This is a work-in-progress update from [[The_ungleich_monitoring_infrastructure]]. The infrastructure is still based on prometheus + blackbox exporter, but now also makes use of kubernetes native objects. 

 h2. Monitoring definition 

 h3. External primary router/link monitoring 

 * Objective: find out from an external PoV whether the lines are functioning 
 * Implementation: 
 ** Collecting/alerting with prometheus on place12 
 ** blackbox on place12 
 ** blackbox on place11 
 * Targets 
 ** ipv6/router1.place10/snr 
 ** ipv4/router1.place10/snr 
 ** ipv6/server12X.place10/snr 
 ** ipv4/server12X.place10/snr 
 ** ipv4/fiberstream/place5 
 ** ipv4/fiberstream/place6 
 ** ipv4/fiberstream/place7 
 ** ipv4/fiberstream/place10 
 * Status: TBD 

 h3. Main DNS servers 

 * Objective: ensure all 3 DNS servers are running and returning queries 
 * Implementation: 
 ** Collecting/alerting with prometheus on place12 
 ** blackbox on place12 
 ** blackbox on place11 
 * Targets 
 ** dns1.ungleich.ch 
 ** dns2.ungleich.ch 
 ** dns3.ungleich.ch 
 * Status: TBD 

 h3. External primary router 

 * Objective: find out whether a router is reachable via any path 
 * Implementation: 
 ** Collecting/alerting with prometheus on place12 
 ** blackbox on place12 
 ** blackbox on place11 
 * Status: TBD 


 

 h3. Test external monitoring 

 * Objective: find out whether the external monitoring is alive 
 * Implementation: 
 ** Collecting/alerting with prometheus on place10 
 * Targets 
 ** ipv6/emonitor1.place12/prometheus 
 ** ipv6/emonitor1.place12/blackbox 
 ** ipv6/emonitor1.place12/alertmanager 
 ** ipv6/vm1.place11/blackbox 

 h3. Test per place monitoring infrastructure (blackbox exporter, prometheus) 

 Each place should provide a blackbox exporter suitable for monitoring onsite targets.  
 We need to ensure that these blackbox exporters all function and that prometheus instances are up. 

 * Objective: find out whether the onsite monitoring is alive 
 * Implementation: 
 ** Collecting/alerting with prometheus on place12 
 * Targets 
 ** blackbox-exporter + prometheus/place5 
 ** blackbox-exporter + prometheus/place6 
 ** blackbox-exporter + prometheus/place10 


 h3. Internal internal router monitoring (TBD) 

 Per place monitor internal routers 

 * Objective: find out whether the internal monitoring is alive 
 * Implementation: 
 ** Collecting/alerting with prometheus on ... 
 * Targets 
 ** ipv6/apu-router1.place6 (via place10/blackbox) 

 h3. Internal network device monitoring 

 * Objective: find out whether all production switches are alive 
 * Implementation: 
 ** Dedicated blackbox_exporter on a router or similar (needs to be secured) 
 * Targets 
 ** All Arista in each place 
 ** All Mikrotik in each place