Project

General

Profile

Actions

Uptime objectives

Introduction

This is an internal (and public) planning document for our data center. It does not constitute an SLA (send us an email if you need SLAs). It is used for planning further stability improvements.

Uptime definitons

% Downtime / year
99 87h or 3.65 days
99.9 8.76h
99.99 0.876h or 52.55 minutes
99.999 5.25 minutes

Power Supply

  • What: Power supply to all systems
  • Setup:
    • Core systems are connected to UPS that last between 7-30 minutes
    • Virtualisation systems are not (yet) fully connected to UPS (to be finished 2019-07)
  • Expected outages
    • Prior to full UPS installation: <= 24h/year (99%)
    • After UPS installation: 99.9%
      • Probably less, as most power outages are <1m
    • Values for 2020/2021 are estimated, need to be confirmed with actual power outages
  • Uptime objective
    • 2019: 99%
    • 2020: 99.9%
    • 2021: 99.99%

L2 Internal Network

  • What: The connection between servers, routers and switches.
  • Setup: All systems are connected twice internally, usually via fiber
  • Expected outages
    • Single switch outage: no outage, maybe short packet loss (LACP link detection might take some seconds)
    • Double switch outage: full outage, manual replacement
  • Uptime objectives
    • From 2019: >= 99.999%

L2 external Network

  • What: the network between the different locations
  • Setup:
    • Provided by local (electricity) companies.
    • No additional active equipment / same as internal network
  • Expected outages
    • 1 in 2018 that could be bridged by Wifi
    • If an outage happens, it's long (digging through the cable)
    • But it happens very rarely
    • Mid term geo redundant lines planned
    • Geo redundancy might be achieved starting 2020
  • Uptime objectives
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

L3 external Network

  • What: the external (uplink) networks
  • Setup
    • Currently 2 uplinks
    • Soon 2 individual plus a third central uplink
  • Expected outages
    • 2019 added bgp support
    • Outage simulations still due
  • Uptime objectives
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

Routers

  • What: the central routers
  • Setup
    • Two routers running Linux with keepalived
    • Both routers are rebooted periodically -> downtime during that time is critical, but unlikely
    • Routers are connected to UPS
    • Routers are running raid1
  • Expected outages
    • Machines are rather reliable
    • If one machines has to be replaced, replacement can be prepared while other routers are active
    • Rare events, nice 2017 no router related downtime
  • Uptime objectives
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

VMs on servers

  • What: Servers host VMs and in case of a defect VMs need to be restarted on a different server
  • Setup:
    • Servers are dual power connected
    • Servers are used hardware
    • Servers are being monitored (prometheus+consul)
    • Not yet sure how to detect soon failng servers
    • So far 3 servers affected (out of about 30)
    • Restart of a VM takes a couple of seconds, as data is distributed in ceph
    • Detection is not yet reliably automated -> needs to be finished in 2019
  • Expected outages
    • At the moment servers "run until they die"
    • In the future servers should be periodically rebooted to detect broken hardware (live migrations enable this)
    • While a server downtime effects all VMs (up to 100 per server), it's a rare event
  • Uptime objectives (per VM)
    • 2019: >= 99.99%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

Storage backends

  • What: the ceph storage that contains the data of VMs and services
  • Setup
    • A disk is striped into 4MB blocks
    • Each block is saved 3x
  • Expected outages
    • Downtime happens at 3 failures at the same time in a near time window
    • 1 disk failure triggers instant replication
    • Disks (HDD, SSD) are ranging from 600GB to 10TB
    • Slow rebuild speed is around 200MB/s
    • Thus slowest rebuild window is 14.56h
  • Uptime objectives (per image)
    • 2019: >= 99.999%
    • 2020: >= 99.999%
    • 2021: >= 99.999%

Updated by Nico Schottelius 5 months ago ยท 9 revisions