part eight · runbooks & rollback
Runbooks & Rollback
Written before the incident, rehearsed in staging.
A runbook is a written procedure for a known operational situation. Authored before the situation occurs. Rehearsed in staging. Lives next to the service it covers. The on-call's only friend at 3am.
What a runbook is
RUNBOOK — Grading flow: high error rate on /api/submissions
Trigger
Alert: submission_error_rate > 1% over 5 minutes
Or: queue_length > 1000
Severity assessment (decide first)
- Is data integrity at risk? → P0, follow this runbook
- Is auth or security implicated? → P0, disable flag, escalate
- Are users blocked from work? → P1
- Is degradation only in features beyond → P2
the cycle's prediction?
Containment (first 5 minutes)
1. Check feature flag panel:
https://flags.200apps.example/grading.flow-v2
If recently changed, REVERT to previous state.
2. Check deploy timeline:
https://deploys.200apps.example/grading
If recent deploy, ROLLBACK with: ./scripts/rollback grading
3. Check upstream LMS adapter health:
https://status.lms-adapter.example
If LMS adapter is down, expected behavior includes errors;
escalate to LMS team.
Diagnosis (after containment)
4. Check error log filter:
`service:grading severity:error`
Identify the most-frequent error class.
5. Check the dashboard panel:
"grading-flow latency p95 by endpoint"
6. Trace one example request from the request ID in the error log:
https://traces.200apps.example/?id=...
Communication
- 5 min: ping #incident-grading with severity + first action
- 15 min: post status page entry if user-facing
- 30 min: communicator updates internal channel
Escalation
P0: page Tech Lead immediately, then PO if user-impact >5 min
P1: page Tech Lead if not resolved in 30 min
P2: page Tech Lead if not resolved in 4 hours
Known false positives
- Spike during 09:00 weekdays (cohort starts grading) — wait
until 09:15 before action
Last reviewed: 2026-05-05 — Yossi
Last rehearsed in staging: 2026-04-28A runbook is short, dry, and procedural. It tells the on-call exactly what to do.
The properties of a runbook that holds
- Authored before the incident. A runbook written during the incident is documentation. A runbook written before is a procedure.
- Rehearsed in staging. The runbook's procedure is run, end to end, against a staged failure. If steps are missing or wrong, the rehearsal surfaces it.
- Read by an on-call who hasn't built the feature. They are the chain's audience. If they cannot follow it, it doesn't work.
- Reviewed regularly. Every postmortem (Volume V Part 4) reads the relevant runbook. Outdated runbooks are repaired or deleted.
- Lives next to the service. Not in a separate wiki. In the code repository or a runbook-store that the on-call has on their phone.
Four levels of rollback
The corpus pattern: rollback has four levels, applied in this order.
| Level | Mechanism | Time | When to use |
|---|---|---|---|
| 1 · Flag | Toggle the feature flag off | Seconds | New behavior is misbehaving |
| 2 · Deploy | Roll back to the previous build | Minutes | Bug exists across both flag paths |
| 3 · Migration | Reverse the schema migration | Hours (if author allowed it) | Data shape change is the cause |
| 4 · Data | Correct affected rows | Hours+ | Data has been corrupted |
Each level is more expensive and more disruptive than the previous. The runbook tells the on-call which level to attempt and in what order.
The corpus's most important rollback discipline: plan for the level you might need before you ship. Migrations should be reversible by default. Data changes should be auditable. Flags should wrap behavior. I didn't think we'd need to roll back is not a postmortem-acceptable answer; that is what the runbook is for.
When a runbook doesn't exist
The on-call hits a situation the runbook doesn't cover. Three steps:
- Contain first. Use the four levels above. Default to flag off if uncertain.
- Document during. A real-time log of what was tried, what happened. Not for reading later — for the next person who hits this.
- Author after. The postmortem produces a runbook entry for this situation, so it is covered next time.
The corpus rule: every incident produces at least one runbook delta. Either the existing runbook needed an update, or a new runbook needed to be written. No incident leaves the runbook unchanged.
Status page
A status page is the communicator's instrument during an incident. The corpus pattern:
- Automatic for known severities — when alerts fire above threshold, a status entry is posted automatically with a default message.
- Manual updates during the incident — every 30 minutes minimum at P0/P1.
- Resolution post — when the incident is closed, the status page shows it as resolved with a one-paragraph postmortem summary.
A status page that is accurate and current builds trust. A status page that says all systems operational while customers are reporting issues destroys it.
What rollback discipline produces for the chain
The chain that takes rollback seriously ships more often. The chain that does not avoids shipping. The math: a deployable artifact with three rollback levels is one the team will release; an artifact with no rollback plan is one the team will sit on.