Uptime objectives¶

Table of contents
Uptime objectives

Introduction¶

This is an internal (and public) planning document for our data center. It does not constitute an SLA (send us an email if you need SLAs). It is used for planning further stability improvements.

Uptime definitons¶

%	Downtime / year
99	87h or 3.65 days
99.9	8.76h
99.99	0.876h or 52.55 minutes
99.999	5.25 minutes

Power Supply¶

What: Power supply to all systems
Setup:
- Core systems are connected to UPS that last between 7-30 minutes
- Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
Expected outages
- Prior to full UPS installation: <= 24h/year (99%)
- After UPS installation: 99.9%
  - Probably less, as most power outages are <1m
- Values for 2020/2021 are estimated, need to be confirmed with actual power outages
Uptime objective
- 2019: 99%
- 2020: 99.9%
- 2021: 99.99%

L2 Internal Network¶

What: The connection between servers, routers and switches.
Setup: All systems are connected twice internally, usually via fiber
Expected outages
- Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
- Double switch outage: full outage, manual replacement
Uptime objectives
- From 2019: >= 99.999%

L2 external Network¶

What: the network between the different locations
Setup:
- Provided by local (electricity) companies.
- No additional active equipment / same as internal network
Expected outages
- 1 in 2018 that could be bridged by Wifi
- If an outage happens, it's long (digging through the cable)
- But it happens very rarely
- Mid term geo redundant lines planned
- Geo redundancy might be achieved starting 2020
Uptime objectives
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%

L3 external Network¶

What: the external (uplink) networks
Setup
- Currently 2 uplinks
- Soon 2 individual plus a third central uplink
Expected outages
- 2019 added bgp support
- Outage simulations still due
Uptime objectives
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%

Routers¶

What: the central routers
Setup
- Two routers running Linux with keepalived
- Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
- Routers are connected to UPS
- Routers are running raid1
Expected outages
- Machines are rather reliable
- If one machines has to be replaced, replacement can be prepared while other routers are active
- Rare events, nice 2017 no router related downtime
Uptime objectives
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%

VMs on servers¶

What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
Setup:
- Servers are dual power connected
- Servers are used hardware
- Servers are being monitored (prometheus+consul)
- Not yet sure how to detect soon failng servers
- So far 3 servers affected (out of about 30)
- Restart of a VM takes a couple of seconds, as data is distributed in ceph
- Detection is not yet reliably automated -> needs to be finished in 2019
Expected outages
- At the moment servers "run until they die"
- In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
- While a server downtime effects all VMs (up to 100 per server), it's a rare event
Uptime objectives (per VM)
- 2019: >= 99.99%
- 2020: >= 99.999%
- 2021: >= 99.999%

Storage backends¶

What: the ceph storage that contains the data of VMs and services
Setup
- A disk is striped into 4MB blocks
- Each block is saved 3x
Expected outages
- Downtime happens at 3 failures at the same time in a near time window
- 1 disk failure triggers instant replication
- Disks (HDD, SSD) are ranging from 600GB to 10TB
- Slow rebuild speed is around 200MB/s
- Thus slowest rebuild window is 14.56h
Uptime objectives (per image)
- 2019: >= 99.999%
- 2020: >= 99.999%
- 2021: >= 99.999%

Files (0)

Updated by Nico Schottelius about 7 years ago · 9 revisions

Project

General

Profile

Open Infrastructure

Wiki