Project

General

Profile

Uptime objectives » History » Version 9

Nico Schottelius, 07/01/2019 07:43 PM

1 1 Nico Schottelius
h1. Uptime objectives
2
3 2 Nico Schottelius
{{toc}}
4 1 Nico Schottelius
5 9 Nico Schottelius
h2. Introduction
6
7
This is an internal (and public) planning document for our data center. It does *not* constitute an SLA (send us an email if you need SLAs). It is used for planning further stability improvements.
8
9 4 Nico Schottelius
h2. Uptime definitons
10
11
12
| %  | Downtime / year |
13
| 99 |  87h or 3.65 days |
14
| 99.9 | 8.76h |
15
| 99.99 | 0.876h or 52.55 minutes |
16
| 99.999 | 5.25 minutes |
17
18 1 Nico Schottelius
h2. Power Supply
19
20
* What: Power supply to all systems
21
* Setup:
22
** Core systems are connected to UPS that last between 7-30 minutes
23 4 Nico Schottelius
** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
24 8 Nico Schottelius
* Expected outages
25 4 Nico Schottelius
** Prior to full UPS installation: <= 24h/year (99%)
26
** After UPS installation: 99.9%
27 1 Nico Schottelius
*** Probably less, as most power outages are <1m
28 8 Nico Schottelius
** Values for 2020/2021 are estimated, need to be confirmed with actual power outages
29
* Uptime objective
30
** 2019: 99%
31
** 2020: 99.9%
32
** 2021: 99.99%
33 1 Nico Schottelius
34 5 Nico Schottelius
h2. L2 Internal Network
35 1 Nico Schottelius
36
* What: The connection between servers, routers and switches.
37
* Setup: All systems are connected twice internally, usually via fiber
38
* Expected outages
39
** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
40
** Double switch outage: full outage, manual replacement
41
* Uptime objectives
42
** From 2019: >= 99.999%
43
44
h2. L2 external Network
45
46
* What: the network between the different locations
47
* Setup:
48
** Provided by local (electricity) companies.
49
** No additional active equipment / same as internal network
50
* Expected outages
51
** 1 in 2018 that could be bridged by Wifi
52
** If an outage happens, it's long (digging through the cable)
53
** But it happens very rarely
54
** Mid term geo redundant lines planned
55 4 Nico Schottelius
** Geo redundancy might be achieved starting 2020
56 1 Nico Schottelius
* Uptime objectives
57
** 2019: >= 99.99%
58 4 Nico Schottelius
** 2020: >= 99.999%
59
** 2021: >= 99.999%
60 1 Nico Schottelius
61
62
h2. L3 external Network
63
64
* What: the external (uplink) networks
65
* Setup
66 4 Nico Schottelius
** Currently 2 uplinks
67
** Soon 2 individual plus a third central uplink
68 1 Nico Schottelius
* Expected outages
69 4 Nico Schottelius
** 2019 added bgp support
70
** Outage simulations still due
71 1 Nico Schottelius
* Uptime objectives
72 4 Nico Schottelius
** 2019: >= 99.99%
73
** 2020: >= 99.999%
74
** 2021: >= 99.999%
75 1 Nico Schottelius
76
77
h2. Routers
78
79 4 Nico Schottelius
* What: the central routers
80
* Setup
81
** Two routers running Linux with keepalived
82
** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
83
** Routers are connected to UPS
84
** Routers are running raid1
85
* Expected outages
86
** Machines are rather reliable
87
** If one machines has to be replaced, replacement can be prepared while other routers are active
88
** Rare events, nice 2017 no router related downtime
89
* Uptime objectives
90
** 2019: >= 99.99%
91
** 2020: >= 99.999%
92
** 2021: >= 99.999%
93 1 Nico Schottelius
94 6 Nico Schottelius
h2. VMs on servers
95 1 Nico Schottelius
96
* What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
97
* Setup:
98
** Servers are dual power connected
99
** Servers are used hardware
100
** Servers are being monitored (prometheus+consul)
101
** Not yet sure how to detect soon failng servers
102
** So far 3 servers affected (out of about 30)
103 6 Nico Schottelius
** Restart of a VM takes a couple of seconds, as data is distributed in ceph
104
** Detection is not yet reliably automated -> needs to be finished in 2019
105 4 Nico Schottelius
* Expected outages
106
** At the moment servers "run until they die"
107
** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
108
** While a server downtime effects all VMs (up to 100 per server), it's a rare event
109
* Uptime objectives (per VM)
110
** 2019: >= 99.99%
111
** 2020: >= 99.999%
112
** 2021: >= 99.999%
113 7 Nico Schottelius
114
h2. Storage backends
115
116
* What: the ceph storage that contains the data of VMs and services
117
* Setup
118
** A disk is striped into 4MB blocks
119
** Each block is saved 3x
120
* Expected outages
121
** Downtime happens at 3 failures at the same time in a near time window
122
** 1 disk failure triggers instant replication
123
** Disks (HDD, SSD) are ranging from 600GB to 10TB
124
** Slow rebuild speed is around 200MB/s
125
** Thus slowest rebuild window is 14.56h
126
* Uptime objectives (per image)
127
** 2019: >= 99.999%
128
** 2020: >= 99.999%
129
** 2021: >= 99.999%