Uptime objectives » History » Revision 3
Revision 2 (Nico Schottelius, 03/01/2019 03:38 PM) → Revision 3/9 (Nico Schottelius, 03/01/2019 05:14 PM)
h1. Uptime objectives
{{toc}}
h2. Power Supply
* What: Power supply to all systems
* Setup:
** Core systems are connected to UPS that last between 7-30 minutes
** Virtualisation systems are not (yet) fully connected to UPS
* Expected outages
h2. Internal Network
* What: The connection between servers, routers and switches.
* Setup: All systems are connected twice internally, usually via fiber
* Expected outages
** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
** Double switch outage: full outage, manual replacement
* Uptime objectives
** 2019: >= 99.999%
** 2020: >= 99.9995%
** 2021: >= 99.9995%
h2. L2 external Network
* What: the network between the different locations
* Setup:
** Provided by local (electricity) companies.
** No additional active equipment / same as internal network
* Expected outages
** 1 in 2018 that could be bridged by Wifi
** If an outage happens, it's long (digging through the cable)
** But it happens very rarely
** Mid term geo redundant lines planned
* Uptime objectives
** 2019: >= 99.99%
** 2020: >= 99.995%
** 2021: >= 99.995%
h2. L3 external Network
* What: the external (uplink) networks
* Setup
** Currently one uplink by EDIG / HIAG
* Expected outages
** Based on 2018
** Partially unresponsive / unwilling to cooperate
** Multiple smaller, one bigger outage
** 2nd and 3rd line providers are evaluated / phased in
** Plan 2019 phase in 1 connection per DC + third at the hub
* Uptime objectives
** 2019: >= 99.9%
** 2020: >= 99.99%
** 2021: >= 99.995%
h2. Routers
h2. Servers
* What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
* Setup:
** Servers are dual power connected
** Servers are used hardware
** Servers are being monitored (prometheus+consul)
** Not yet sure how to detect soon failng servers
** So far 2 servers affected (out of about 20)
** Assuming 10% failure rate per