Task #6919
closedDefine incident / downtime notification channels and reaction times
0%
Description
Request / Input¶
- Proposal
- Prerequisites
A trusted, dedicated channel for communicating service notifications
needs to be established. The channel should be a feed (RSS, Twitter,
etc). Having it machine readable has great benefits for downstream
automation.
There are exactly two types of service notifications we expect to be
sent over this channel: (1) "Scheduled Maintenance" and (2) "Incident
Report".
The channel must not contain other messages. (This keeps a relay to
3rd parties simple.)
- Case 1: Scheduled Maintenance
A notification about a "Scheduled Maintenance" informs in advance about
planned works like moving servers, upgrades, etc. Details should, as a
minimum, include:
- A short description of the plans
- Planned starting time
- Planned ending time
- Expected downtime: yes/no
- Case 2: Incident Report
An incident report informs about degraded service or unexpected
downtime that ungleich experiences spontaneously. It also serves as a
notice of action, signaling that ungleich is aware of the issue and is
taking appropriate steps to resolve the issue.
When ungleich encounters problems with infrastructure, ungleich issues
a first incident report via the dedicated channel. The report should
as a minimum include:
- Very brief state of information
It does not need to include a detailed analysis, planned mitigations
or an expected time-frame.
When ungleich has analyzed the issue further, and it is foreseeable
that the problem will not be fixed within a to be defined time-frame
(for example 2h), ungleich sends another notification with a short
update which includes the new findings and information on when the
downtime is expected to end.
If the problem persists after another to be defined time interval (for
example 3h), ungleich sends another short notification with an update
on the last notification and continues to send updates on this
interval.
Notes from Nico¶
- Probably external channel (i.e. twitter alike) and self run channel (openness!)
Updated by Nico Schottelius about 1 year ago
- Status changed from New to Rejected