Uptime objectives » History » Revision 7
Revision 6 (Nico Schottelius, 07/01/2019 07:34 PM) → Revision 7/9 (Nico Schottelius, 07/01/2019 07:40 PM)
h1. Uptime objectives
{{toc}}
h2. Uptime definitons
| % | Downtime / year |
| 99 | 87h or 3.65 days |
| 99.9 | 8.76h |
| 99.99 | 0.876h or 52.55 minutes |
| 99.999 | 5.25 minutes |
h2. Power Supply
* What: Power supply to all systems
* Setup:
** Core systems are connected to UPS that last between 7-30 minutes
** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
* Uptime objective
** Prior to full UPS installation: <= 24h/year (99%)
** After UPS installation: 99.9%
*** Probably less, as most power outages are <1m
h2. L2 Internal Network
* What: The connection between servers, routers and switches.
* Setup: All systems are connected twice internally, usually via fiber
* Expected outages
** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
** Double switch outage: full outage, manual replacement
* Uptime objectives
** From 2019: >= 99.999%
h2. L2 external Network
* What: the network between the different locations
* Setup:
** Provided by local (electricity) companies.
** No additional active equipment / same as internal network
* Expected outages
** 1 in 2018 that could be bridged by Wifi
** If an outage happens, it's long (digging through the cable)
** But it happens very rarely
** Mid term geo redundant lines planned
** Geo redundancy might be achieved starting 2020
* Uptime objectives
** 2019: >= 99.99%
** 2020: >= 99.999%
** 2021: >= 99.999%
h2. L3 external Network
* What: the external (uplink) networks
* Setup
** Currently 2 uplinks
** Soon 2 individual plus a third central uplink
* Expected outages
** 2019 added bgp support
** Outage simulations still due
* Uptime objectives
** 2019: >= 99.99%
** 2020: >= 99.999%
** 2021: >= 99.999%
h2. Routers
* What: the central routers
* Setup
** Two routers running Linux with keepalived
** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
** Routers are connected to UPS
** Routers are running raid1
* Expected outages
** Machines are rather reliable
** If one machines has to be replaced, replacement can be prepared while other routers are active
** Rare events, nice 2017 no router related downtime
* Uptime objectives
** 2019: >= 99.99%
** 2020: >= 99.999%
** 2021: >= 99.999%
h2. VMs on servers
* What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
* Setup:
** Servers are dual power connected
** Servers are used hardware
** Servers are being monitored (prometheus+consul)
** Not yet sure how to detect soon failng servers
** So far 3 servers affected (out of about 30)
** Restart of a VM takes a couple of seconds, as data is distributed in ceph
** Detection is not yet reliably automated -> needs to be finished in 2019
* Expected outages
** At the moment servers "run until they die"
** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
** While a server downtime effects all VMs (up to 100 per server), it's a rare event
* Uptime objectives (per VM)
** 2019: >= 99.99%
** 2020: >= 99.999%
** 2021: >= 99.999%
h2. Storage backends
* What: the ceph storage that contains the data of VMs and services
* Setup
** A disk is striped into 4MB blocks
** Each block is saved 3x
* Expected outages
** Downtime happens at 3 failures at the same time in a near time window
** 1 disk failure triggers instant replication
** Disks (HDD, SSD) are ranging from 600GB to 10TB
** Slow rebuild speed is around 200MB/s
** Thus slowest rebuild window is 14.56h
* Uptime objectives (per image)
** 2019: >= 99.999%
** 2020: >= 99.999%
** 2021: >= 99.999%