Post-Release & Learning · master area
Incident Management
Detect → contain → communicate → resolve. Contain before you diagnose. Three roles — commander, communicator, investigator — that never combine, even on one-person on-call.
Owners: Tech Lead, On-call Phase it lives in: After We Build (Volume V) The corpus principle this enacts: Contain before diagnose. Escalation is information flow, not blame.
Where it lives in the chain
- After We Build · Incidents and Postmortems — the canon
How to do this
When monitoring fires or an SLO breaches past tolerance:
- Detect — the page arrives.
- Contain — disable the feature flag. Minutes, not hours. Contain before you diagnose. Four levers in order: flag off → deploy rollback → migration rollback → data correction.
- Communicate — the pre-written template goes to the client if the incident exceeds 15 minutes. Not after resolution — during.
- Resolve — root cause identified, fix tested in staging, deployed through normal pipeline. Flag re-enabled only after staging confirms.
The three roles
Even on a small team, three roles exist. On a one-person on-call, the hats switch in time — they are not held simultaneously.
- Incident commander — decides what gets done next. Holds the timeline. Does not investigate.
- Communicator — runs the status page, client comms, internal channel. Updates every 30 minutes minimum.
- Investigator — looks at the data. Reports findings to the commander. Does not communicate, does not decide.
Escalation matrix by priority
| Severity | Notify in 15 min | Notify in 30 min | Status page |
|---|---|---|---|
| P0 | Leadership + on-call | War room open; client within 1h | Automatic |
| P1 | Engineering manager + PM | Team assembled if unresolved | Manual |
| P2 | PM | Daily updates | Internal note |
| P3 | Team lead at next standup | — | — |
The "no surprises" rule: leadership should never learn about a problem from a customer.
What good practice looks like
The JWT outage: 4 minutes to detect, 8 minutes to identify root cause, 5 minutes to revert. 44 minutes total, including the soak and verification. The speed was good. The prevention — the missing token-compatibility smoke test — was the gap. Incident response is rehearsed, not improvised; the runbook is what makes the rehearsal possible.
Related crafts
- Escalation
- Runbooks
- Rollback Discipline
- Postmortem — what follows resolution