Uptime objectives » History » Revision 4
Revision 3 (Nico Schottelius, 03/01/2019 05:14 PM) → Revision 4/9 (Nico Schottelius, 07/01/2019 07:33 PM)
h1. Uptime objectives {{toc}} h2. Uptime definitons | % | Downtime / year | | 99 | 87h or 3.65 days | | 99.9 | 8.76h | | 99.99 | 0.876h or 52.55 minutes | | 99.999 | 5.25 minutes | h2. Power Supply * What: Power supply to all systems * Setup: ** Core systems are connected to UPS that last between 7-30 minutes ** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07) * Uptime objective ** Prior to full UPS installation: <= 24h/year (99%) ** After UPS installation: 99.9% *** Probably less, as most power Expected outages are <1m h2. Internal Network * What: The connection between servers, routers and switches. * Setup: All systems are connected twice internally, usually via fiber * Expected outages ** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds) ** Double switch outage: full outage, manual replacement * Uptime objectives ** From 2019: >= 99.999% ** 2020: >= 99.9995% ** 2021: >= 99.9995% h2. L2 external Network * What: the network between the different locations * Setup: ** Provided by local (electricity) companies. ** No additional active equipment / same as internal network * Expected outages ** 1 in 2018 that could be bridged by Wifi ** If an outage happens, it's long (digging through the cable) ** But it happens very rarely ** Mid term geo redundant lines planned ** Geo redundancy might be achieved starting 2020 * Uptime objectives ** 2019: >= 99.99% ** 2020: >= 99.999% 99.995% ** 2021: >= 99.999% 99.995% h2. L3 external Network * What: the external (uplink) networks * Setup ** Currently 2 uplinks ** Soon 2 individual plus a third central one uplink by EDIG / HIAG * Expected outages ** 2019 added bgp support Based on 2018 ** Outage simulations still due * Uptime objectives ** 2019: >= 99.99% ** 2020: >= 99.999% ** 2021: >= 99.999% h2. Routers * What: the central routers * Setup ** Two routers running Linux with keepalived ** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely ** Routers are connected Partially unresponsive / unwilling to UPS cooperate ** Routers are running raid1 Multiple smaller, one bigger outage * Expected outages ** Machines 2nd and 3rd line providers are rather reliable evaluated / phased in ** If one machines has to be replaced, replacement can be prepared while other routers are active Plan 2019 phase in 1 connection per DC + third at the hub ** Rare events, nice 2017 no router related downtime * Uptime objectives ** 2019: >= 99.99% 99.9% ** 2020: >= 99.999% 99.99% ** 2021: >= 99.999% 99.995% h2. Routers h2. Servers * What: Servers host VMs and in case of a defect VMs need to be restarted on a different server * Setup: ** Servers are dual power connected ** Servers are used hardware ** Servers are being monitored (prometheus+consul) ** Not yet sure how to detect soon failng servers ** So far 3 2 servers affected (out of about 30) 20) * Expected outages ** At the moment servers "run until they die" ** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this) ** While a server downtime effects all VMs (up to 100 Assuming 10% failure rate per server), it's a rare event * Uptime objectives (per VM) ** 2019: >= 99.99% ** 2020: >= 99.999% ** 2021: >= 99.999%