Skip to content

SLA Monitoring & Breach Protocol

Two layers of alert. Early warning when threshold is approached. Breach protocol when threshold is crossed. The client is the first person notified, not the last.

Owners: Tech Lead, PO Phase it lives in: After We Build (Volume V) The corpus principle this enacts: A team that reports its own SLA breach before the client notices builds trust. A team that hopes the client didn't notice erodes it irreversibly.

Where it lives in the chain

The two layers

LayerWhenWho actsAction
Early warningSLA threshold approached (not crossed)On-call + POInvestigate the trend; communicate proactively if risk is real.
BreachSLA threshold crossedTech Lead + PO + CommunicatorIncident process: contain, communicate, resolve. Client notified first.

The breach protocol

  1. Detect — automated alert fires when the threshold crosses. Same SLI dashboards, additional alert layer.
  2. Contain — same four levers as any incident: flag off → deploy rollback → migration rollback → data correction.
  3. Communicate — the PO reaches out to the client proactively, within minutes. "We are seeing X. We are doing Y. We will tell you Z by W."
  4. Resolve — root cause identified, fix deployed, monitoring confirms recovery.
  5. Postmortem — same week. SLA breaches are always P-level enough to warrant the structural-fix discipline.
  6. Make right — credits, named recovery commitments, follow-up SLA review.

What good practice looks like

The dashboard shows latency creeping toward the SLA threshold. Early warning fires. The PO sends the client: "We're seeing increased latency on submissions; investigating. Will update in 30 minutes." The team finds the cause — a slow query — and rolls back the relevant deploy. The threshold never crosses. The client received transparency without panic; the team owned the moment before the client had to ask.

The JWT outage was an SLA breach. The team's response — the proactive client comms during the 44-minute resolution, the postmortem within 48 hours, the structural-fix commitment (environment-gated deployments, token compatibility smoke test) — is what limited the trust cost. The breach was contained by the protocol, not by the absence of the breach.

200apps · How We Work · NWIRE