Project

General

Profile

Uptime objectives » History » Version 8

Nico Schottelius, 07/01/2019 07:41 PM

1 1 Nico Schottelius
h1. Uptime objectives
2
3 2 Nico Schottelius
{{toc}}
4 1 Nico Schottelius
5 4 Nico Schottelius
h2. Uptime definitons
6
7
8
| %  | Downtime / year |
9
| 99 |  87h or 3.65 days |
10
| 99.9 | 8.76h |
11
| 99.99 | 0.876h or 52.55 minutes |
12
| 99.999 | 5.25 minutes |
13
14 1 Nico Schottelius
h2. Power Supply
15
16
* What: Power supply to all systems
17
* Setup:
18
** Core systems are connected to UPS that last between 7-30 minutes
19 4 Nico Schottelius
** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
20 8 Nico Schottelius
* Expected outages
21 4 Nico Schottelius
** Prior to full UPS installation: <= 24h/year (99%)
22
** After UPS installation: 99.9%
23 1 Nico Schottelius
*** Probably less, as most power outages are <1m
24 8 Nico Schottelius
** Values for 2020/2021 are estimated, need to be confirmed with actual power outages
25
* Uptime objective
26
** 2019: 99%
27
** 2020: 99.9%
28
** 2021: 99.99%
29 1 Nico Schottelius
30 5 Nico Schottelius
h2. L2 Internal Network
31 1 Nico Schottelius
32
* What: The connection between servers, routers and switches.
33
* Setup: All systems are connected twice internally, usually via fiber
34
* Expected outages
35
** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
36
** Double switch outage: full outage, manual replacement
37
* Uptime objectives
38
** From 2019: >= 99.999%
39
40
h2. L2 external Network
41
42
* What: the network between the different locations
43
* Setup:
44
** Provided by local (electricity) companies.
45
** No additional active equipment / same as internal network
46
* Expected outages
47
** 1 in 2018 that could be bridged by Wifi
48
** If an outage happens, it's long (digging through the cable)
49
** But it happens very rarely
50
** Mid term geo redundant lines planned
51 4 Nico Schottelius
** Geo redundancy might be achieved starting 2020
52 1 Nico Schottelius
* Uptime objectives
53
** 2019: >= 99.99%
54 4 Nico Schottelius
** 2020: >= 99.999%
55
** 2021: >= 99.999%
56 1 Nico Schottelius
57
58
h2. L3 external Network
59
60
* What: the external (uplink) networks
61
* Setup
62 4 Nico Schottelius
** Currently 2 uplinks
63
** Soon 2 individual plus a third central uplink
64 1 Nico Schottelius
* Expected outages
65 4 Nico Schottelius
** 2019 added bgp support
66
** Outage simulations still due
67 1 Nico Schottelius
* Uptime objectives
68 4 Nico Schottelius
** 2019: >= 99.99%
69
** 2020: >= 99.999%
70
** 2021: >= 99.999%
71 1 Nico Schottelius
72
73
h2. Routers
74
75 4 Nico Schottelius
* What: the central routers
76
* Setup
77
** Two routers running Linux with keepalived
78
** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
79
** Routers are connected to UPS
80
** Routers are running raid1
81
* Expected outages
82
** Machines are rather reliable
83
** If one machines has to be replaced, replacement can be prepared while other routers are active
84
** Rare events, nice 2017 no router related downtime
85
* Uptime objectives
86
** 2019: >= 99.99%
87
** 2020: >= 99.999%
88
** 2021: >= 99.999%
89 1 Nico Schottelius
90 6 Nico Schottelius
h2. VMs on servers
91 1 Nico Schottelius
92
* What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
93
* Setup:
94
** Servers are dual power connected
95
** Servers are used hardware
96
** Servers are being monitored (prometheus+consul)
97
** Not yet sure how to detect soon failng servers
98
** So far 3 servers affected (out of about 30)
99 6 Nico Schottelius
** Restart of a VM takes a couple of seconds, as data is distributed in ceph
100
** Detection is not yet reliably automated -> needs to be finished in 2019
101 4 Nico Schottelius
* Expected outages
102
** At the moment servers "run until they die"
103
** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
104
** While a server downtime effects all VMs (up to 100 per server), it's a rare event
105
* Uptime objectives (per VM)
106
** 2019: >= 99.99%
107
** 2020: >= 99.999%
108
** 2021: >= 99.999%
109 7 Nico Schottelius
110
h2. Storage backends
111
112
* What: the ceph storage that contains the data of VMs and services
113
* Setup
114
** A disk is striped into 4MB blocks
115
** Each block is saved 3x
116
* Expected outages
117
** Downtime happens at 3 failures at the same time in a near time window
118
** 1 disk failure triggers instant replication
119
** Disks (HDD, SSD) are ranging from 600GB to 10TB
120
** Slow rebuild speed is around 200MB/s
121
** Thus slowest rebuild window is 14.56h
122
* Uptime objectives (per image)
123
** 2019: >= 99.999%
124
** 2020: >= 99.999%
125
** 2021: >= 99.999%