Version 9 - History - Uptime objectives - Open Infrastructure - ungleich redmine

Uptime objectives » History » Version 9

Nico Schottelius, 07/01/2019 07:43 PM

-Nico Schottelius
+h1. Uptime objectives
-Nico Schottelius
+{{toc}}
 Nico Schottelius
-Nico Schottelius
+h2. Introduction
 This is an internal (and public) planning document for our data center. It does *not* constitute an SLA (send us an email if you need SLAs). It is used for planning further stability improvements.
-Nico Schottelius
+h2. Uptime definitons
 | %  | Downtime / year |
 | 99 |  87h or 3.65 days |
 | 99.9 | 8.76h |
 | 99.99 | 0.876h or 52.55 minutes |
 | 99.999 | 5.25 minutes |
-Nico Schottelius
+h2. Power Supply
 * What: Power supply to all systems
 * Setup:
 ** Core systems are connected to UPS that last between 7-30 minutes
-Nico Schottelius
+** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
-Nico Schottelius
+* Expected outages
-Nico Schottelius
+** Prior to full UPS installation: <= 24h/year (99%)
 ** After UPS installation: 99.9%
-Nico Schottelius
+*** Probably less, as most power outages are <1m
-Nico Schottelius
+** Values for 2020/2021 are estimated, need to be confirmed with actual power outages
 * Uptime objective
 ** 2019: 99%
 ** 2020: 99.9%
 ** 2021: 99.99%
 Nico Schottelius
-Nico Schottelius
+h2. L2 Internal Network
 Nico Schottelius
 * What: The connection between servers, routers and switches.
 * Setup: All systems are connected twice internally, usually via fiber
 * Expected outages
 ** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
 ** Double switch outage: full outage, manual replacement
 * Uptime objectives
 ** From 2019: >= 99.999%
 h2. L2 external Network
 * What: the network between the different locations
 * Setup:
 ** Provided by local (electricity) companies.
 ** No additional active equipment / same as internal network
 * Expected outages
 ** 1 in 2018 that could be bridged by Wifi
 ** If an outage happens, it's long (digging through the cable)
 ** But it happens very rarely
 ** Mid term geo redundant lines planned
-Nico Schottelius
+** Geo redundancy might be achieved starting 2020
-Nico Schottelius
+* Uptime objectives
 ** 2019: >= 99.99%
-Nico Schottelius
+** 2020: >= 99.999%
 ** 2021: >= 99.999%
 Nico Schottelius
 h2. L3 external Network
 * What: the external (uplink) networks
 * Setup
-Nico Schottelius
+** Currently 2 uplinks
 ** Soon 2 individual plus a third central uplink
-Nico Schottelius
+* Expected outages
-Nico Schottelius
+** 2019 added bgp support
 ** Outage simulations still due
-Nico Schottelius
+* Uptime objectives
-Nico Schottelius
+** 2019: >= 99.99%
 ** 2020: >= 99.999%
 ** 2021: >= 99.999%
 Nico Schottelius
 h2. Routers
-Nico Schottelius
+* What: the central routers
 * Setup
 ** Two routers running Linux with keepalived
 ** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
 ** Routers are connected to UPS
 ** Routers are running raid1
 * Expected outages
 ** Machines are rather reliable
 ** If one machines has to be replaced, replacement can be prepared while other routers are active
 ** Rare events, nice 2017 no router related downtime
 * Uptime objectives
 ** 2019: >= 99.99%
 ** 2020: >= 99.999%
 ** 2021: >= 99.999%
 Nico Schottelius
-Nico Schottelius
+h2. VMs on servers
 Nico Schottelius
 * What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
 * Setup:
 ** Servers are dual power connected
 ** Servers are used hardware
 ** Servers are being monitored (prometheus+consul)
 ** Not yet sure how to detect soon failng servers
 ** So far 3 servers affected (out of about 30)
-Nico Schottelius
+** Restart of a VM takes a couple of seconds, as data is distributed in ceph
 ** Detection is not yet reliably automated -> needs to be finished in 2019
-Nico Schottelius
+* Expected outages
 ** At the moment servers "run until they die"
 ** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
 ** While a server downtime effects all VMs (up to 100 per server), it's a rare event
 * Uptime objectives (per VM)
 ** 2019: >= 99.99%
 ** 2020: >= 99.999%
 ** 2021: >= 99.999%
 Nico Schottelius
 h2. Storage backends
 * What: the ceph storage that contains the data of VMs and services
 * Setup
 ** A disk is striped into 4MB blocks
 ** Each block is saved 3x
 * Expected outages
 ** Downtime happens at 3 failures at the same time in a near time window
 ** 1 disk failure triggers instant replication
 ** Disks (HDD, SSD) are ranging from 600GB to 10TB
 ** Slow rebuild speed is around 200MB/s
 ** Thus slowest rebuild window is 14.56h
 * Uptime objectives (per image)
 ** 2019: >= 99.999%
 ** 2020: >= 99.999%
 ** 2021: >= 99.999%

Project

General

Profile

Open Infrastructure

Uptime objectives » History » Version 9