Project

General

Profile

Actions

Uptime objectives » History » Revision 7

« Previous | Revision 7/9 (diff) | Next »
Nico Schottelius, 07/01/2019 07:40 PM


Uptime objectives

Uptime definitons

% Downtime / year
99 87h or 3.65 days
99.9 8.76h
99.99 0.876h or 52.55 minutes
99.999 5.25 minutes

Power Supply

  • What: Power supply to all systems
  • Setup:
    • Core systems are connected to UPS that last between 7-30 minutes
    • Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
  • Uptime objective
    • Prior to full UPS installation: <= 24h/year (99%)
    • After UPS installation: 99.9%
      • Probably less, as most power outages are <1m

L2 Internal Network

  • What: The connection between servers, routers and switches.
  • Setup: All systems are connected twice internally, usually via fiber
  • Expected outages
    • Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
    • Double switch outage: full outage, manual replacement
  • Uptime objectives
    • From 2019: >= 99.999%

L2 external Network

  • What: the network between the different locations
  • Setup:
    • Provided by local (electricity) companies.
    • No additional active equipment / same as internal network
  • Expected outages
    • 1 in 2018 that could be bridged by Wifi
    • If an outage happens, it's long (digging through the cable)
    • But it happens very rarely
    • Mid term geo redundant lines planned
    • Geo redundancy might be achieved starting 2020
  • Uptime objectives
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

L3 external Network

  • What: the external (uplink) networks
  • Setup
    • Currently 2 uplinks
    • Soon 2 individual plus a third central uplink
  • Expected outages
    • 2019 added bgp support
    • Outage simulations still due
  • Uptime objectives
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

Routers

  • What: the central routers
  • Setup
    • Two routers running Linux with keepalived
    • Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
    • Routers are connected to UPS
    • Routers are running raid1
  • Expected outages
    • Machines are rather reliable
    • If one machines has to be replaced, replacement can be prepared while other routers are active
    • Rare events, nice 2017 no router related downtime
  • Uptime objectives
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

VMs on servers

  • What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
  • Setup:
    • Servers are dual power connected
    • Servers are used hardware
    • Servers are being monitored (prometheus+consul)
    • Not yet sure how to detect soon failng servers
    • So far 3 servers affected (out of about 30)
    • Restart of a VM takes a couple of seconds, as data is distributed in ceph
    • Detection is not yet reliably automated -> needs to be finished in 2019
  • Expected outages
    • At the moment servers "run until they die"
    • In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
    • While a server downtime effects all VMs (up to 100 per server), it's a rare event
  • Uptime objectives (per VM)
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

Storage backends

  • What: the ceph storage that contains the data of VMs and services
  • Setup
    • A disk is striped into 4MB blocks
    • Each block is saved 3x
  • Expected outages
    • Downtime happens at 3 failures at the same time in a near time window
    • 1 disk failure triggers instant replication
    • Disks (HDD, SSD) are ranging from 600GB to 10TB
    • Slow rebuild speed is around 200MB/s
    • Thus slowest rebuild window is 14.56h
  • Uptime objectives (per image)
    • 2019: >= 99.999%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

Updated by Nico Schottelius over 4 years ago · 7 revisions