Skip to content

Incidents and Postmortems

Contain before diagnose. Postmortems that produce structural changes, not feelings.

Events in this phase

Incident begins when a runbook condition fires or an SLO is breached past tolerance. Containment is the first action — diagnosis comes after. Postmortem is scheduled within five working days, regardless of severity. Action items are owned, dated, and tracked alongside stories.

An incident is what happens when reality goes more wrong than the chain anticipated. The point of incident discipline is not heroism. It is to make the same incident less likely the next cycle, by trading the slow expensive way of finding gaps (live customers) for the fast cheap way (chain artifacts).

Contain before diagnose

The first action in an incident is not figuring out why. It is to stop the harm from growing. Four levers, in order of preference:

  1. Flag off — the new behavior is wrapped, the switch is one click. No deploy. No code.
  2. Roll deploy back — last known good. CI keeps the artifact for exactly this reason.
  3. Migration rollback — only if the migration was authored to be reversible. If it wasn't, the incident is bigger than this incident.
  4. Data correction — last resort. Always after the bleeding has stopped, never during.

Diagnosis happens after containment. Trying to diagnose during containment slows containment. The runbook tells the on-call which lever to pull and which to skip.

Roles during the incident

Three roles, even on a small team. They can be held by the same person on a one-person on-call but they are still distinct hats.

  • Incident commander — decides what gets done next. Holds the timeline.
  • Communicator — runs the status page, the client comms, the internal channel. Updates every 30 minutes minimum.
  • Investigator — looks at the data. Reports findings to the commander.

The commander does not investigate. The investigator does not communicate. The communicator does not decide.

Escalation and de-escalation

Escalation is information flow, not blame flow. The rule: no surprises. If leadership will hear about this incident from anywhere other than the commander, they hear about it from the commander first.

SeverityWho hears in 5 minutesWho hears in 30 minutesStatus page
P0 — production downOn-call group, Tech Lead, POLeadership, affected clientsAutomatic
P1 — partial degradationOn-call group, Tech LeadPO, leadership at next syncManual update
P2 — single-feature impactOn-call groupPO at next standupInternal note only
P3 — annoyanceOn-call group at next standup

De-escalation is as deliberate as escalation. The incident is not over until the commander stands the team down, the status page is updated to resolved, the timeline is archived, and someone has checked on the people who took the page. The cost of an unannounced de-escalation is a team that stays anxious into the next cycle.

The postmortem

Within five working days. Blameless in tone. Structural in output.

The postmortem asks, in order:

  1. Timeline — what happened, when, who knew. Drawn from the commander's notes.
  2. Detection — when did the system know? When did a person know? Difference is the detection gap.
  3. Containment — how did we stop the bleeding? Were the runbooks adequate?
  4. Root cause — five whys, but with chain levels as the answer space. Not why did the developer not see this but why did the chain not catch it before this developer.
  5. What was missed — which level had the right opportunity to prevent this incident, and didn't?
  6. Structural fix — owned, dated, tracked. Not "we'll be more careful." A change to a checklist, a runbook, a test, a brief template, a CI step.

A postmortem that produces an action item with no owner and no date is not a postmortem. It is a feeling that was written down.

What goes back into the chain

Every postmortem produces at least one chain-artifact change. Examples that hold:

  • A runbook gains a new condition.
  • The release-gate checklist gains a new item.
  • The Feature Brief template gains a new question that would have caught this.
  • The CI pipeline gains a new check.
  • A new ADR records the constraint that the incident revealed.

The model update absorbs the rest. Without the model update, the postmortem becomes a memorial.

Resolution gate — incident closed

Enough to know the chain learned.

Status page resolved. Timeline archived. Postmortem complete with at least one structural change owned and dated. The runbook or checklist that was missing now exists.

Part 5 — The Retrospective →

200apps · How We Work · NWIRE