Project

General

Profile

Uptime objectives » History » Revision 3

Revision 2 (Nico Schottelius, 03/01/2019 03:38 PM) → Revision 3/9 (Nico Schottelius, 03/01/2019 05:14 PM)

h1. Uptime objectives 

 {{toc}} 

 h2. Power Supply 

 * What: Power supply to all systems 
 * Setup: 
 ** Core systems are connected to UPS that last between 7-30 minutes 
 ** Virtualisation systems are not (yet) fully connected to UPS 
 * Expected outages 




 h2. Internal Network 

 * What: The connection between servers, routers and switches. 
 * Setup: All systems are connected twice internally, usually via fiber 
 * Expected outages 
 ** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds) 
 ** Double switch outage: full outage, manual replacement 
 * Uptime objectives 
 ** 2019: >= 99.999% 
 ** 2020: >= 99.9995% 
 ** 2021: >= 99.9995% 



 h2. L2 external Network 

 * What: the network between the different locations 
 * Setup: 
 ** Provided by local (electricity) companies. 
 ** No additional active equipment / same as internal network 
 * Expected outages 
 ** 1 in 2018 that could be bridged by Wifi 
 ** If an outage happens, it's long (digging through the cable) 
 ** But it happens very rarely 
 ** Mid term geo redundant lines planned 
 * Uptime objectives 
 ** 2019: >= 99.99% 
 ** 2020: >= 99.995% 
 ** 2021: >= 99.995% 


 h2. L3 external Network 

 * What: the external (uplink) networks 
 * Setup 
 ** Currently one uplink by EDIG / HIAG 
 * Expected outages 
 ** Based on 2018 
 ** Partially unresponsive / unwilling to cooperate 
 ** Multiple smaller, one bigger outage 
 **    2nd and 3rd line providers are evaluated / phased in 
 ** Plan 2019 phase in 1 connection per DC + third at the hub 
 * Uptime objectives 
 ** 2019: >= 99.9% 
 ** 2020: >= 99.99% 
 ** 2021: >= 99.995% 


 h2. Routers 



 h2. Servers 

 * What: Servers host VMs and in case of a defect VMs need to be restarted on a different server 
 * Setup: 
 ** Servers are dual power connected 
 ** Servers are used hardware 
 ** Servers are being monitored (prometheus+consul) 
 ** Not yet sure how to detect soon failng servers 
 ** So far 2 servers affected (out of about 20) 
 ** Assuming 10% failure rate per