Uptime objectives » History » Version 8
  Nico Schottelius, 07/01/2019 07:41 PM 
  
| 1 | 1 | Nico Schottelius | h1. Uptime objectives | 
|---|---|---|---|
| 2 | |||
| 3 | 2 | Nico Schottelius | {{toc}} | 
| 4 | 1 | Nico Schottelius | |
| 5 | 4 | Nico Schottelius | h2. Uptime definitons | 
| 6 | |||
| 7 | |||
| 8 | | % | Downtime / year | | ||
| 9 | | 99 | 87h or 3.65 days | | ||
| 10 | | 99.9 | 8.76h | | ||
| 11 | | 99.99 | 0.876h or 52.55 minutes | | ||
| 12 | | 99.999 | 5.25 minutes | | ||
| 13 | |||
| 14 | 1 | Nico Schottelius | h2. Power Supply | 
| 15 | |||
| 16 | * What: Power supply to all systems | ||
| 17 | * Setup: | ||
| 18 | ** Core systems are connected to UPS that last between 7-30 minutes | ||
| 19 | 4 | Nico Schottelius | ** Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07) | 
| 20 | 8 | Nico Schottelius | * Expected outages | 
| 21 | 4 | Nico Schottelius | ** Prior to full UPS installation: <= 24h/year (99%) | 
| 22 | ** After UPS installation: 99.9% | ||
| 23 | 1 | Nico Schottelius | *** Probably less, as most power outages are <1m | 
| 24 | 8 | Nico Schottelius | ** Values for 2020/2021 are estimated, need to be confirmed with actual power outages | 
| 25 | * Uptime objective | ||
| 26 | ** 2019: 99% | ||
| 27 | ** 2020: 99.9% | ||
| 28 | ** 2021: 99.99% | ||
| 29 | 1 | Nico Schottelius | |
| 30 | 5 | Nico Schottelius | h2. L2 Internal Network | 
| 31 | 1 | Nico Schottelius | |
| 32 | * What: The connection between servers, routers and switches. | ||
| 33 | * Setup: All systems are connected twice internally, usually via fiber | ||
| 34 | * Expected outages | ||
| 35 | ** Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds) | ||
| 36 | ** Double switch outage: full outage, manual replacement | ||
| 37 | * Uptime objectives | ||
| 38 | ** From 2019: >= 99.999% | ||
| 39 | |||
| 40 | h2. L2 external Network | ||
| 41 | |||
| 42 | * What: the network between the different locations | ||
| 43 | * Setup: | ||
| 44 | ** Provided by local (electricity) companies. | ||
| 45 | ** No additional active equipment / same as internal network | ||
| 46 | * Expected outages | ||
| 47 | ** 1 in 2018 that could be bridged by Wifi | ||
| 48 | ** If an outage happens, it's long (digging through the cable) | ||
| 49 | ** But it happens very rarely | ||
| 50 | ** Mid term geo redundant lines planned | ||
| 51 | 4 | Nico Schottelius | ** Geo redundancy might be achieved starting 2020 | 
| 52 | 1 | Nico Schottelius | * Uptime objectives | 
| 53 | ** 2019: >= 99.99% | ||
| 54 | 4 | Nico Schottelius | ** 2020: >= 99.999% | 
| 55 | ** 2021: >= 99.999% | ||
| 56 | 1 | Nico Schottelius | |
| 57 | |||
| 58 | h2. L3 external Network | ||
| 59 | |||
| 60 | * What: the external (uplink) networks | ||
| 61 | * Setup | ||
| 62 | 4 | Nico Schottelius | ** Currently 2 uplinks | 
| 63 | ** Soon 2 individual plus a third central uplink | ||
| 64 | 1 | Nico Schottelius | * Expected outages | 
| 65 | 4 | Nico Schottelius | ** 2019 added bgp support | 
| 66 | ** Outage simulations still due | ||
| 67 | 1 | Nico Schottelius | * Uptime objectives | 
| 68 | 4 | Nico Schottelius | ** 2019: >= 99.99% | 
| 69 | ** 2020: >= 99.999% | ||
| 70 | ** 2021: >= 99.999% | ||
| 71 | 1 | Nico Schottelius | |
| 72 | |||
| 73 | h2. Routers | ||
| 74 | |||
| 75 | 4 | Nico Schottelius | * What: the central routers | 
| 76 | * Setup | ||
| 77 | ** Two routers running Linux with keepalived | ||
| 78 | ** Both routers are rebooted periodically -> downtime during that time is critical, but unlikely | ||
| 79 | ** Routers are connected to UPS | ||
| 80 | ** Routers are running raid1 | ||
| 81 | * Expected outages | ||
| 82 | ** Machines are rather reliable | ||
| 83 | ** If one machines has to be replaced, replacement can be prepared while other routers are active | ||
| 84 | ** Rare events, nice 2017 no router related downtime | ||
| 85 | * Uptime objectives | ||
| 86 | ** 2019: >= 99.99% | ||
| 87 | ** 2020: >= 99.999% | ||
| 88 | ** 2021: >= 99.999% | ||
| 89 | 1 | Nico Schottelius | |
| 90 | 6 | Nico Schottelius | h2. VMs on servers | 
| 91 | 1 | Nico Schottelius | |
| 92 | * What: Servers host VMs and in case of a defect VMs need to be restarted on a different server | ||
| 93 | * Setup: | ||
| 94 | ** Servers are dual power connected | ||
| 95 | ** Servers are used hardware | ||
| 96 | ** Servers are being monitored (prometheus+consul) | ||
| 97 | ** Not yet sure how to detect soon failng servers | ||
| 98 | ** So far 3 servers affected (out of about 30) | ||
| 99 | 6 | Nico Schottelius | ** Restart of a VM takes a couple of seconds, as data is distributed in ceph | 
| 100 | ** Detection is not yet reliably automated -> needs to be finished in 2019 | ||
| 101 | 4 | Nico Schottelius | * Expected outages | 
| 102 | ** At the moment servers "run until they die" | ||
| 103 | ** In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this) | ||
| 104 | ** While a server downtime effects all VMs (up to 100 per server), it's a rare event | ||
| 105 | * Uptime objectives (per VM) | ||
| 106 | ** 2019: >= 99.99% | ||
| 107 | ** 2020: >= 99.999% | ||
| 108 | ** 2021: >= 99.999% | ||
| 109 | 7 | Nico Schottelius | |
| 110 | h2. Storage backends | ||
| 111 | |||
| 112 | * What: the ceph storage that contains the data of VMs and services | ||
| 113 | * Setup | ||
| 114 | ** A disk is striped into 4MB blocks | ||
| 115 | ** Each block is saved 3x | ||
| 116 | * Expected outages | ||
| 117 | ** Downtime happens at 3 failures at the same time in a near time window | ||
| 118 | ** 1 disk failure triggers instant replication | ||
| 119 | ** Disks (HDD, SSD) are ranging from 600GB to 10TB | ||
| 120 | ** Slow rebuild speed is around 200MB/s | ||
| 121 | ** Thus slowest rebuild window is 14.56h | ||
| 122 | * Uptime objectives (per image) | ||
| 123 | ** 2019: >= 99.999% | ||
| 124 | ** 2020: >= 99.999% | ||
| 125 | ** 2021: >= 99.999% |