Skip to content

Incident Management

Detect → contain → communicate → resolve. Contain before you diagnose. Three roles — commander, communicator, investigator — that never combine, even on one-person on-call.

Owners: Tech Lead, On-call Phase it lives in: After We Build (Volume V) The corpus principle this enacts: Contain before diagnose. Escalation is information flow, not blame.

Where it lives in the chain

How to do this

When monitoring fires or an SLO breaches past tolerance:

  1. Detect — the page arrives.
  2. Containdisable the feature flag. Minutes, not hours. Contain before you diagnose. Four levers in order: flag off → deploy rollback → migration rollback → data correction.
  3. Communicate — the pre-written template goes to the client if the incident exceeds 15 minutes. Not after resolution — during.
  4. Resolve — root cause identified, fix tested in staging, deployed through normal pipeline. Flag re-enabled only after staging confirms.

The three roles

Even on a small team, three roles exist. On a one-person on-call, the hats switch in time — they are not held simultaneously.

  • Incident commander — decides what gets done next. Holds the timeline. Does not investigate.
  • Communicator — runs the status page, client comms, internal channel. Updates every 30 minutes minimum.
  • Investigator — looks at the data. Reports findings to the commander. Does not communicate, does not decide.

Escalation matrix by priority

SeverityNotify in 15 minNotify in 30 minStatus page
P0Leadership + on-callWar room open; client within 1hAutomatic
P1Engineering manager + PMTeam assembled if unresolvedManual
P2PMDaily updatesInternal note
P3Team lead at next standup

The "no surprises" rule: leadership should never learn about a problem from a customer.

What good practice looks like

The JWT outage: 4 minutes to detect, 8 minutes to identify root cause, 5 minutes to revert. 44 minutes total, including the soak and verification. The speed was good. The prevention — the missing token-compatibility smoke test — was the gap. Incident response is rehearsed, not improvised; the runbook is what makes the rehearsal possible.

200apps · How We Work · NWIRE