Pipeline & Operations · master area
Monitoring & Alerting
Dashboards. SLO thresholds. Alert routing. Every alert has a runbook link. Every alert has a named on-call. Every alert is testable, not aspirational.*
Owners: Tech Lead, DevOps, On-call Phase it lives in: How We Build → After We Build The corpus principle this enacts: The dashboards are the primary source of truth — not support tickets.
Where it lives in the chain
- How We Build · Testing · System signals — where SLIs are designed
- After We Build · The First 48 Hours — where monitoring is watched after release
How to do this
The discipline is specificity, not volume:
- Every alert has a runbook link. If on-call gets paged and no runbook exists, the runbook is the first artefact produced — before fixing the incident.
- Every alert has an SLO threshold. "Latency above 800ms p95 for 5 minutes" — not "latency is high."
- Every alert has an owner. The team that wrote the SLO is the team that gets paged.
- Every dashboard has named SLIs. Adoption rate, completion rate, error encounter rate, p95 latency, queue depth. Named in the brief, baselined before release.
- Alerts route by priority. P0 → pager. P1 → on-call channel. P2 → ticket. P3 → daily digest. Every alert at the same priority is no priority.
What good practice looks like
The team's monitoring has a known shape:
- A dashboard per Feature — read by the PO during the first 48 hours, then weekly.
- A dashboard per Service — read by the Tech Lead, watched by on-call.
- A dashboard at portfolio level — DORA + VRI + SLA status. Read at quarterly review.
Alerts are tested in staging before they reach production routing. An alert that fires once a quarter, untested, is an alert nobody trusts. The team simulates the failure that would trigger it; the alert fires; the runbook works.
The wallet bug surfaced first in staging dashboards, not in alerts — the threshold was wider than the bug's variance. The lesson: leading signals at lower thresholds catch what SLO alerts miss.
Related crafts
- Observability — what makes alerts diagnosable
- Runbooks — what each alert links to
- The First 48 Hours — when monitoring matters most