Project

General

Profile

Uptime objectives » History » Revision 7

Revision 6 (Nico Schottelius, 07/01/2019 07:34 PM) → Revision 7/9 (Nico Schottelius, 07/01/2019 07:40 PM)

h1. Uptime objectives 

 {{toc}} 

 h2. Uptime definitons 


 | %    | Downtime / year | 
 | 99 |    87h or 3.65 days | 
 | 99.9 | 8.76h | 
 | 99.99 | 0.876h or 52.55 minutes | 
 | 99.999 | 5.25 minutes | 

 h2. Power Supply 

 * What: Power supply to all systems 
 * Setup: 
 ** Core systems are connected to UPS that last between 7-30 minutes 
 ** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07) 
 * Uptime objective 
 ** Prior to full UPS installation: <= 24h/year (99%) 
 ** After UPS installation: 99.9% 
 *** Probably less, as most power outages are <1m 

 h2. L2 Internal Network 

 * What: The connection between servers, routers and switches. 
 * Setup: All systems are connected twice internally, usually via fiber 
 * Expected outages 
 ** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds) 
 ** Double switch outage: full outage, manual replacement 
 * Uptime objectives 
 ** From 2019: >= 99.999% 

 h2. L2 external Network 

 * What: the network between the different locations 
 * Setup: 
 ** Provided by local (electricity) companies. 
 ** No additional active equipment / same as internal network 
 * Expected outages 
 ** 1 in 2018 that could be bridged by Wifi 
 ** If an outage happens, it's long (digging through the cable) 
 ** But it happens very rarely 
 ** Mid term geo redundant lines planned 
 ** Geo redundancy might be achieved starting 2020 
 * Uptime objectives 
 ** 2019: >= 99.99% 
 ** 2020: >= 99.999% 
 ** 2021: >= 99.999% 


 h2. L3 external Network 

 * What: the external (uplink) networks 
 * Setup 
 ** Currently 2 uplinks 
 ** Soon 2 individual plus a third central uplink 
 * Expected outages 
 ** 2019 added bgp support 
 ** Outage simulations still due 
 * Uptime objectives 
 ** 2019: >= 99.99% 
 ** 2020: >= 99.999% 
 ** 2021: >= 99.999% 


 h2. Routers 

 * What: the central routers 
 * Setup 
 ** Two routers running Linux with keepalived 
 ** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely 
 ** Routers are connected to UPS 
 ** Routers are running raid1 
 * Expected outages 
 ** Machines are rather reliable 
 ** If one machines has to be replaced, replacement can be prepared while other routers are active 
 ** Rare events, nice 2017 no router related downtime 
 * Uptime objectives 
 ** 2019: >= 99.99% 
 ** 2020: >= 99.999% 
 ** 2021: >= 99.999% 

 h2. VMs on servers 

 * What: Servers host VMs and in case of a defect VMs need to be restarted on a different server 
 * Setup: 
 ** Servers are dual power connected 
 ** Servers are used hardware 
 ** Servers are being monitored (prometheus+consul) 
 ** Not yet sure how to detect soon failng servers 
 ** So far 3 servers affected (out of about 30) 
 ** Restart of a VM takes a couple of seconds, as data is distributed in ceph 
 ** Detection is not yet reliably automated -> needs to be finished in 2019 
 * Expected outages 
 ** At the moment servers "run until they die" 
 ** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this) 
 ** While a server downtime effects all VMs (up to 100 per server), it's a rare event 
 * Uptime objectives (per VM) 
 ** 2019: >= 99.99% 
 ** 2020: >= 99.999% 
 ** 2021: >= 99.999% 

 h2. Storage backends 

 * What: the ceph storage that contains the data of VMs and services 
 * Setup 
 ** A disk is striped into 4MB blocks 
 ** Each block is saved 3x 
 * Expected outages 
 ** Downtime happens at 3 failures at the same time in a near time window 
 ** 1 disk failure triggers instant replication 
 ** Disks (HDD, SSD) are ranging from 600GB to 10TB 
 ** Slow rebuild speed is around 200MB/s 
 ** Thus slowest rebuild window is 14.56h 
 * Uptime objectives (per image) 
 ** 2019: >= 99.999% 
 ** 2020: >= 99.999% 
 ** 2021: >= 99.999%