Skip to content

On-call

On-call is the cycle's last and first reader. They are awake when reality answers. Their job is to keep the timeline honest, to communicate to the people watching, and to hand the postmortem something it can actually use.

What good looks like

A competent on-call shift produces three artefacts every cycle they are summoned:

  1. A timeline — minute-by-minute, written during, that the postmortem can read without rewriting.
  2. A comms thread — what was told to whom and when. Engineering thread. Client thread if needed. No surprises in the morning.
  3. A runbook entry or change — every shift either uses a runbook (and notes the friction) or writes one. No two pages on the same scenario.

An on-call shift that produces these has the operation chain working. A shift that skips the timeline leaves the postmortem inventing history; one that skips the runbook update locks the knowledge in one person.

The on-call's stance

On-call is responsible forOn-call is not responsible for
The timeline being written during, not afterSolving the root cause alone
The comms thread to engineering, leadership, clientBeing the only person on the bridge
Following the runbook or noting why they didn'tRewriting the architecture at 2am
The first 48-hour watch on every cycle's first releaseThe full release decision
Handing the postmortem material it can useOwning the postmortem's fix

On-call holds the chain by holding the moment the system was under stress and the record of what happened.

Three artefacts to read first

  1. Runbooks & Rollback
  2. The First 48 Hours
  3. Incidents & Postmortems

See also

200apps · How We Work · NWIRE