Actions
Uptime objectives » History » Revision 7
« Previous |
Revision 7/9
(diff)
| Next »
Nico Schottelius, 07/01/2019 07:40 PM
Uptime objectives¶
- Table of contents
- Uptime objectives
Uptime definitons¶
% | Downtime / year |
99 | 87h or 3.65 days |
99.9 | 8.76h |
99.99 | 0.876h or 52.55 minutes |
99.999 | 5.25 minutes |
Power Supply¶
- What: Power supply to all systems
- Setup:
- Core systems are connected to UPS that last between 7-30 minutes
- Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
- Uptime objective
- Prior to full UPS installation: <= 24h/year (99%)
- After UPS installation: 99.9%
- Probably less, as most power outages are <1m
L2 Internal Network¶
- What: The connection between servers, routers and switches.
- Setup: All systems are connected twice internally, usually via fiber
- Expected outages
- Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
- Double switch outage: full outage, manual replacement
- Uptime objectives
- From 2019: >= 99.999%
L2 external Network¶
- What: the network between the different locations
- Setup:
- Provided by local (electricity) companies.
- No additional active equipment / same as internal network
- Expected outages
- 1 in 2018 that could be bridged by Wifi
- If an outage happens, it's long (digging through the cable)
- But it happens very rarely
- Mid term geo redundant lines planned
- Geo redundancy might be achieved starting 2020
- Uptime objectives
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%
L3 external Network¶
- What: the external (uplink) networks
- Setup
- Currently 2 uplinks
- Soon 2 individual plus a third central uplink
- Expected outages
- 2019 added bgp support
- Outage simulations still due
- Uptime objectives
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%
Routers¶
- What: the central routers
- Setup
- Two routers running Linux with keepalived
- Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
- Routers are connected to UPS
- Routers are running raid1
- Expected outages
- Machines are rather reliable
- If one machines has to be replaced, replacement can be prepared while other routers are active
- Rare events, nice 2017 no router related downtime
- Uptime objectives
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%
VMs on servers¶
- What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
- Setup:
- Servers are dual power connected
- Servers are used hardware
- Servers are being monitored (prometheus+consul)
- Not yet sure how to detect soon failng servers
- So far 3 servers affected (out of about 30)
- Restart of a VM takes a couple of seconds, as data is distributed in ceph
- Detection is not yet reliably automated -> needs to be finished in 2019
- Expected outages
- At the moment servers "run until they die"
- In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
- While a server downtime effects all VMs (up to 100 per server), it's a rare event
- Uptime objectives (per VM)
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%
Storage backends¶
- What: the ceph storage that contains the data of VMs and services
- Setup
- A disk is striped into 4MB blocks
- Each block is saved 3x
- Expected outages
- Downtime happens at 3 failures at the same time in a near time window
- 1 disk failure triggers instant replication
- Disks (HDD, SSD) are ranging from 600GB to 10TB
- Slow rebuild speed is around 200MB/s
- Thus slowest rebuild window is 14.56h
- Uptime objectives (per image)
- 2019: >= 99.999%
- 2020: >= 99.999%
- 2021: >= 99.999%
Updated by Nico Schottelius over 5 years ago · 7 revisions