Skip to content

Incidents and Postmortems

Contain before diagnose. Escalation is information flow, not blame. The postmortem question is not who failed but which level of the chain was missing.

Events in this phase

The page arrives when monitoring fires. Contain first — disable the flag. Escalate — information flows up, not blame. Postmortem within 48 hours, calendar-locked.

The incident process

An incident is not a bug. A bug is a defect in the code. An incident is a failure actively affecting users that requires immediate response.

  • Detect — monitoring fires.
  • Containdisable the feature flag. Minutes, not hours. Contain before you diagnose.
  • Communicate — the pre-written template goes to the client if the incident exceeds 15 minutes. Not after resolution — during.
  • Resolve — root cause identified, fix tested in staging, deployed through normal pipeline. Flag re-enabled only after staging confirms.

Escalation is information flow, not blame

When you escalate, you are saying: "This problem has exceeded my ability to resolve it within the expected timeframe."

Three principles:

  1. Speed over perfectiona 90%-accurate message in 5 minutes beats a perfect one in 30.
  2. Over-escalate, then stand downde-escalating a false alarm is always cheaper than under-escalating a real incident.
  3. Escalation is a skillteams must practice it, not discover it under pressure.

Escalation matrix by priority

P0

  • Notify leadership + on-call within 15 minutes.
  • War room open within 30 minutes.
  • Status updates every 30 minutes.
  • Client notified proactively within 1 hour.
  • Post-mortem required within 24 hours.
  • All other work stops.

P1

  • Notify engineering manager + PM within 1 hour.
  • Assemble team within 2 hours if unresolved.
  • Status updates every 2 hours.
  • Client notified if >10 users affected.
  • Post-mortem within 48 hours.

P2

  • Notify PM within 4 hours.
  • Daily status updates until resolved.
  • Reactive client communication (respond to tickets).
  • Post-mortem optional.

P3

  • Team lead at next standup.
  • Sprint-level tracking.
  • No client communication.
  • No post-mortem.

The "no surprises" rule: leadership should never learn about a problem from a customer. If a customer knows before your manager does, the escalation process has failed. This doesn't mean every bug is escalated — it means any issue with customer-visible impact is communicated upward before customers start calling.

Continue — The postmortem and de-escalation →

200apps · How We Work · NWIRE