part eight · runbooks & rollback

Runbooks & Rollback

Written before the incident, rehearsed in staging.

A runbook is a written procedure for a known operational situation. Authored before the situation occurs. Rehearsed in staging. Lives next to the service it covers. The on-call's only friend at 3am.

What a runbook is

text

RUNBOOK — Grading flow: high error rate on /api/submissions

Trigger
  Alert: submission_error_rate > 1% over 5 minutes
  Or:    queue_length > 1000

Severity assessment (decide first)
  - Is data integrity at risk?              → P0, follow this runbook
  - Is auth or security implicated?         → P0, disable flag, escalate
  - Are users blocked from work?            → P1
  - Is degradation only in features beyond  → P2
    the cycle's prediction?

Containment (first 5 minutes)
  1. Check feature flag panel:
     https://flags.200apps.example/grading.flow-v2
     If recently changed, REVERT to previous state.
  2. Check deploy timeline:
     https://deploys.200apps.example/grading
     If recent deploy, ROLLBACK with: ./scripts/rollback grading
  3. Check upstream LMS adapter health:
     https://status.lms-adapter.example
     If LMS adapter is down, expected behavior includes errors;
     escalate to LMS team.

Diagnosis (after containment)
  4. Check error log filter:
     `service:grading severity:error`
     Identify the most-frequent error class.
  5. Check the dashboard panel:
     "grading-flow latency p95 by endpoint"
  6. Trace one example request from the request ID in the error log:
     https://traces.200apps.example/?id=...

Communication
  - 5 min: ping #incident-grading with severity + first action
  - 15 min: post status page entry if user-facing
  - 30 min: communicator updates internal channel

Escalation
  P0: page Tech Lead immediately, then PO if user-impact >5 min
  P1: page Tech Lead if not resolved in 30 min
  P2: page Tech Lead if not resolved in 4 hours

Known false positives
  - Spike during 09:00 weekdays (cohort starts grading) — wait
    until 09:15 before action

Last reviewed: 2026-05-05 — Yossi
Last rehearsed in staging: 2026-04-28

A runbook is short, dry, and procedural. It tells the on-call exactly what to do.

The properties of a runbook that holds

Authored before the incident. A runbook written during the incident is documentation. A runbook written before is a procedure.
Rehearsed in staging. The runbook's procedure is run, end to end, against a staged failure. If steps are missing or wrong, the rehearsal surfaces it.
Read by an on-call who hasn't built the feature. They are the chain's audience. If they cannot follow it, it doesn't work.
Reviewed regularly. Every postmortem (Volume V Part 4) reads the relevant runbook. Outdated runbooks are repaired or deleted.
Lives next to the service. Not in a separate wiki. In the code repository or a runbook-store that the on-call has on their phone.

Four levels of rollback

The corpus pattern: rollback has four levels, applied in this order.

Level	Mechanism	Time	When to use
1 · Flag	Toggle the feature flag off	Seconds	New behavior is misbehaving
2 · Deploy	Roll back to the previous build	Minutes	Bug exists across both flag paths
3 · Migration	Reverse the schema migration	Hours (if author allowed it)	Data shape change is the cause
4 · Data	Correct affected rows	Hours+	Data has been corrupted

Each level is more expensive and more disruptive than the previous. The runbook tells the on-call which level to attempt and in what order.

The corpus's most important rollback discipline: plan for the level you might need before you ship. Migrations should be reversible by default. Data changes should be auditable. Flags should wrap behavior. I didn't think we'd need to roll back is not a postmortem-acceptable answer; that is what the runbook is for.

When a runbook doesn't exist

The on-call hits a situation the runbook doesn't cover. Three steps:

Contain first. Use the four levels above. Default to flag off if uncertain.
Document during. A real-time log of what was tried, what happened. Not for reading later — for the next person who hits this.
Author after. The postmortem produces a runbook entry for this situation, so it is covered next time.

The corpus rule: every incident produces at least one runbook delta. Either the existing runbook needed an update, or a new runbook needed to be written. No incident leaves the runbook unchanged.

Status page

A status page is the communicator's instrument during an incident. The corpus pattern:

Automatic for known severities — when alerts fire above threshold, a status entry is posted automatically with a default message.
Manual updates during the incident — every 30 minutes minimum at P0/P1.
Resolution post — when the incident is closed, the status page shows it as resolved with a one-paragraph postmortem summary.

A status page that is accurate and current builds trust. A status page that says all systems operational while customers are reporting issues destroys it.

What rollback discipline produces for the chain

The chain that takes rollback seriously ships more often. The chain that does not avoids shipping. The math: a deployable artifact with three rollback levels is one the team will release; an artifact with no rollback plan is one the team will sit on.

Part 9 — Observability →

✦ Why We Build

◐ Before We Build

◑ What We Build

● How We Build

◔ After We Build

◕ Did We Serve? (legacy)

Runbooks & Rollback

What a runbook is

The properties of a runbook that holds

Four levels of rollback

When a runbook doesn't exist

Status page

What rollback discipline produces for the chain

Runbooks & Rollback ​

What a runbook is ​

The properties of a runbook that holds ​

Four levels of rollback ​

When a runbook doesn't exist ​

Status page ​

What rollback discipline produces for the chain ​

Runbooks & Rollback

What a runbook is

The properties of a runbook that holds

Four levels of rollback

When a runbook doesn't exist

Status page

What rollback discipline produces for the chain