Uptime objectives » History » Version 9
Nico Schottelius, 07/01/2019 07:43 PM
1 | 1 | Nico Schottelius | h1. Uptime objectives |
---|---|---|---|
2 | |||
3 | 2 | Nico Schottelius | {{toc}} |
4 | 1 | Nico Schottelius | |
5 | 9 | Nico Schottelius | h2. Introduction |
6 | |||
7 | This is an internal (and public) planning document for our data center. It does *not* constitute an SLA (send us an email if you need SLAs). It is used for planning further stability improvements. |
||
8 | |||
9 | 4 | Nico Schottelius | h2. Uptime definitons |
10 | |||
11 | |||
12 | | % | Downtime / year | |
||
13 | | 99 | 87h or 3.65 days | |
||
14 | | 99.9 | 8.76h | |
||
15 | | 99.99 | 0.876h or 52.55 minutes | |
||
16 | | 99.999 | 5.25 minutes | |
||
17 | |||
18 | 1 | Nico Schottelius | h2. Power Supply |
19 | |||
20 | * What: Power supply to all systems |
||
21 | * Setup: |
||
22 | ** Core systems are connected to UPS that last between 7-30 minutes |
||
23 | 4 | Nico Schottelius | ** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07) |
24 | 8 | Nico Schottelius | * Expected outages |
25 | 4 | Nico Schottelius | ** Prior to full UPS installation: <= 24h/year (99%) |
26 | ** After UPS installation: 99.9% |
||
27 | 1 | Nico Schottelius | *** Probably less, as most power outages are <1m |
28 | 8 | Nico Schottelius | ** Values for 2020/2021 are estimated, need to be confirmed with actual power outages |
29 | * Uptime objective |
||
30 | ** 2019: 99% |
||
31 | ** 2020: 99.9% |
||
32 | ** 2021: 99.99% |
||
33 | 1 | Nico Schottelius | |
34 | 5 | Nico Schottelius | h2. L2 Internal Network |
35 | 1 | Nico Schottelius | |
36 | * What: The connection between servers, routers and switches. |
||
37 | * Setup: All systems are connected twice internally, usually via fiber |
||
38 | * Expected outages |
||
39 | ** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds) |
||
40 | ** Double switch outage: full outage, manual replacement |
||
41 | * Uptime objectives |
||
42 | ** From 2019: >= 99.999% |
||
43 | |||
44 | h2. L2 external Network |
||
45 | |||
46 | * What: the network between the different locations |
||
47 | * Setup: |
||
48 | ** Provided by local (electricity) companies. |
||
49 | ** No additional active equipment / same as internal network |
||
50 | * Expected outages |
||
51 | ** 1 in 2018 that could be bridged by Wifi |
||
52 | ** If an outage happens, it's long (digging through the cable) |
||
53 | ** But it happens very rarely |
||
54 | ** Mid term geo redundant lines planned |
||
55 | 4 | Nico Schottelius | ** Geo redundancy might be achieved starting 2020 |
56 | 1 | Nico Schottelius | * Uptime objectives |
57 | ** 2019: >= 99.99% |
||
58 | 4 | Nico Schottelius | ** 2020: >= 99.999% |
59 | ** 2021: >= 99.999% |
||
60 | 1 | Nico Schottelius | |
61 | |||
62 | h2. L3 external Network |
||
63 | |||
64 | * What: the external (uplink) networks |
||
65 | * Setup |
||
66 | 4 | Nico Schottelius | ** Currently 2 uplinks |
67 | ** Soon 2 individual plus a third central uplink |
||
68 | 1 | Nico Schottelius | * Expected outages |
69 | 4 | Nico Schottelius | ** 2019 added bgp support |
70 | ** Outage simulations still due |
||
71 | 1 | Nico Schottelius | * Uptime objectives |
72 | 4 | Nico Schottelius | ** 2019: >= 99.99% |
73 | ** 2020: >= 99.999% |
||
74 | ** 2021: >= 99.999% |
||
75 | 1 | Nico Schottelius | |
76 | |||
77 | h2. Routers |
||
78 | |||
79 | 4 | Nico Schottelius | * What: the central routers |
80 | * Setup |
||
81 | ** Two routers running Linux with keepalived |
||
82 | ** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely |
||
83 | ** Routers are connected to UPS |
||
84 | ** Routers are running raid1 |
||
85 | * Expected outages |
||
86 | ** Machines are rather reliable |
||
87 | ** If one machines has to be replaced, replacement can be prepared while other routers are active |
||
88 | ** Rare events, nice 2017 no router related downtime |
||
89 | * Uptime objectives |
||
90 | ** 2019: >= 99.99% |
||
91 | ** 2020: >= 99.999% |
||
92 | ** 2021: >= 99.999% |
||
93 | 1 | Nico Schottelius | |
94 | 6 | Nico Schottelius | h2. VMs on servers |
95 | 1 | Nico Schottelius | |
96 | * What: Servers host VMs and in case of a defect VMs need to be restarted on a different server |
||
97 | * Setup: |
||
98 | ** Servers are dual power connected |
||
99 | ** Servers are used hardware |
||
100 | ** Servers are being monitored (prometheus+consul) |
||
101 | ** Not yet sure how to detect soon failng servers |
||
102 | ** So far 3 servers affected (out of about 30) |
||
103 | 6 | Nico Schottelius | ** Restart of a VM takes a couple of seconds, as data is distributed in ceph |
104 | ** Detection is not yet reliably automated -> needs to be finished in 2019 |
||
105 | 4 | Nico Schottelius | * Expected outages |
106 | ** At the moment servers "run until they die" |
||
107 | ** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this) |
||
108 | ** While a server downtime effects all VMs (up to 100 per server), it's a rare event |
||
109 | * Uptime objectives (per VM) |
||
110 | ** 2019: >= 99.99% |
||
111 | ** 2020: >= 99.999% |
||
112 | ** 2021: >= 99.999% |
||
113 | 7 | Nico Schottelius | |
114 | h2. Storage backends |
||
115 | |||
116 | * What: the ceph storage that contains the data of VMs and services |
||
117 | * Setup |
||
118 | ** A disk is striped into 4MB blocks |
||
119 | ** Each block is saved 3x |
||
120 | * Expected outages |
||
121 | ** Downtime happens at 3 failures at the same time in a near time window |
||
122 | ** 1 disk failure triggers instant replication |
||
123 | ** Disks (HDD, SSD) are ranging from 600GB to 10TB |
||
124 | ** Slow rebuild speed is around 200MB/s |
||
125 | ** Thus slowest rebuild window is 14.56h |
||
126 | * Uptime objectives (per image) |
||
127 | ** 2019: >= 99.999% |
||
128 | ** 2020: >= 99.999% |
||
129 | ** 2021: >= 99.999% |