Project

General

Profile

Actions

Task #6919

closed

Define incident / downtime notification channels and reaction times

Added by Nico Schottelius over 4 years ago. Updated 3 months ago.

Status:
Rejected
Priority:
Normal
Target version:
-
Start date:
07/02/2019
Due date:
07/07/2019
% Done:

0%

Estimated time:
PM Check date:

Description

Request / Input

  • Proposal
  • Prerequisites

A trusted, dedicated channel for communicating service notifications
needs to be established. The channel should be a feed (RSS, Twitter,
etc). Having it machine readable has great benefits for downstream
automation.

There are exactly two types of service notifications we expect to be
sent over this channel: (1) "Scheduled Maintenance" and (2) "Incident
Report".

The channel must not contain other messages. (This keeps a relay to
3rd parties simple.)

  • Case 1: Scheduled Maintenance

A notification about a "Scheduled Maintenance" informs in advance about
planned works like moving servers, upgrades, etc. Details should, as a
minimum, include:

- A short description of the plans
- Planned starting time
- Planned ending time
- Expected downtime: yes/no

  • Case 2: Incident Report

An incident report informs about degraded service or unexpected
downtime that ungleich experiences spontaneously. It also serves as a
notice of action, signaling that ungleich is aware of the issue and is
taking appropriate steps to resolve the issue.

When ungleich encounters problems with infrastructure, ungleich issues
a first incident report via the dedicated channel. The report should
as a minimum include:

- Very brief state of information

It does not need to include a detailed analysis, planned mitigations
or an expected time-frame.

When ungleich has analyzed the issue further, and it is foreseeable
that the problem will not be fixed within a to be defined time-frame
(for example 2h), ungleich sends another notification with a short
update which includes the new findings and information on when the
downtime is expected to end.

If the problem persists after another to be defined time interval (for
example 3h), ungleich sends another short notification with an update
on the last notification and continues to send updates on this
interval.

Notes from Nico

  • Probably external channel (i.e. twitter alike) and self run channel (openness!)
Actions #1

Updated by Nico Schottelius 3 months ago

  • Status changed from New to Rejected
Actions

Also available in: Atom PDF