Project

General

Profile

Uptime objectives » History » Version 4

Nico Schottelius, 07/01/2019 07:33 PM

1 1 Nico Schottelius
h1. Uptime objectives
2
3 2 Nico Schottelius
{{toc}}
4 1 Nico Schottelius
5 4 Nico Schottelius
h2. Uptime definitons
6
7
8
| %  | Downtime / year |
9
| 99 |  87h or 3.65 days |
10
| 99.9 | 8.76h |
11
| 99.99 | 0.876h or 52.55 minutes |
12
| 99.999 | 5.25 minutes |
13
14 1 Nico Schottelius
h2. Power Supply
15
16
* What: Power supply to all systems
17
* Setup:
18
** Core systems are connected to UPS that last between 7-30 minutes
19 4 Nico Schottelius
** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
20
* Uptime objective
21
** Prior to full UPS installation: <= 24h/year (99%)
22
** After UPS installation: 99.9%
23
*** Probably less, as most power outages are <1m
24 1 Nico Schottelius
25
26
h2. Internal Network
27
28
* What: The connection between servers, routers and switches.
29
* Setup: All systems are connected twice internally, usually via fiber
30
* Expected outages
31
** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
32
** Double switch outage: full outage, manual replacement
33
* Uptime objectives
34 4 Nico Schottelius
** From 2019: >= 99.999%
35 1 Nico Schottelius
36
37
h2. L2 external Network
38
39
* What: the network between the different locations
40
* Setup:
41
** Provided by local (electricity) companies.
42
** No additional active equipment / same as internal network
43
* Expected outages
44
** 1 in 2018 that could be bridged by Wifi
45
** If an outage happens, it's long (digging through the cable)
46
** But it happens very rarely
47
** Mid term geo redundant lines planned
48 4 Nico Schottelius
** Geo redundancy might be achieved starting 2020
49 1 Nico Schottelius
* Uptime objectives
50
** 2019: >= 99.99%
51 4 Nico Schottelius
** 2020: >= 99.999%
52
** 2021: >= 99.999%
53 1 Nico Schottelius
54
55
h2. L3 external Network
56
57
* What: the external (uplink) networks
58
* Setup
59 4 Nico Schottelius
** Currently 2 uplinks
60
** Soon 2 individual plus a third central uplink
61 1 Nico Schottelius
* Expected outages
62 4 Nico Schottelius
** 2019 added bgp support
63
** Outage simulations still due
64 1 Nico Schottelius
* Uptime objectives
65 4 Nico Schottelius
** 2019: >= 99.99%
66
** 2020: >= 99.999%
67
** 2021: >= 99.999%
68 1 Nico Schottelius
69
70
h2. Routers
71
72 4 Nico Schottelius
* What: the central routers
73
* Setup
74
** Two routers running Linux with keepalived
75
** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
76
** Routers are connected to UPS
77
** Routers are running raid1
78
* Expected outages
79
** Machines are rather reliable
80
** If one machines has to be replaced, replacement can be prepared while other routers are active
81
** Rare events, nice 2017 no router related downtime
82
* Uptime objectives
83
** 2019: >= 99.99%
84
** 2020: >= 99.999%
85
** 2021: >= 99.999%
86 1 Nico Schottelius
87
88
h2. Servers
89
90
* What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
91
* Setup:
92
** Servers are dual power connected
93
** Servers are used hardware
94
** Servers are being monitored (prometheus+consul)
95 3 Nico Schottelius
** Not yet sure how to detect soon failng servers
96 4 Nico Schottelius
** So far 3 servers affected (out of about 30)
97
* Expected outages
98
** At the moment servers "run until they die"
99
** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
100
** While a server downtime effects all VMs (up to 100 per server), it's a rare event
101
* Uptime objectives (per VM)
102
** 2019: >= 99.99%
103
** 2020: >= 99.999%
104
** 2021: >= 99.999%