Skip to content

Monitoring & Alerting

Dashboards. SLO thresholds. Alert routing. Every alert has a runbook link. Every alert has a named on-call. Every alert is testable, not aspirational.*

Owners: Tech Lead, DevOps, On-call Phase it lives in: How We Build → After We Build The corpus principle this enacts: The dashboards are the primary source of truth — not support tickets.

Where it lives in the chain

How to do this

The discipline is specificity, not volume:

  1. Every alert has a runbook link. If on-call gets paged and no runbook exists, the runbook is the first artefact produced — before fixing the incident.
  2. Every alert has an SLO threshold. "Latency above 800ms p95 for 5 minutes" — not "latency is high."
  3. Every alert has an owner. The team that wrote the SLO is the team that gets paged.
  4. Every dashboard has named SLIs. Adoption rate, completion rate, error encounter rate, p95 latency, queue depth. Named in the brief, baselined before release.
  5. Alerts route by priority. P0 → pager. P1 → on-call channel. P2 → ticket. P3 → daily digest. Every alert at the same priority is no priority.

What good practice looks like

The team's monitoring has a known shape:

  • A dashboard per Feature — read by the PO during the first 48 hours, then weekly.
  • A dashboard per Service — read by the Tech Lead, watched by on-call.
  • A dashboard at portfolio level — DORA + VRI + SLA status. Read at quarterly review.

Alerts are tested in staging before they reach production routing. An alert that fires once a quarter, untested, is an alert nobody trusts. The team simulates the failure that would trigger it; the alert fires; the runbook works.

The wallet bug surfaced first in staging dashboards, not in alerts — the threshold was wider than the bug's variance. The lesson: leading signals at lower thresholds catch what SLO alerts miss.

200apps · How We Work · NWIRE