Ongoing Operations & Client · master area
SLA Monitoring & Breach Protocol
Two layers of alert. Early warning when threshold is approached. Breach protocol when threshold is crossed. The client is the first person notified, not the last.
Owners: Tech Lead, PO Phase it lives in: After We Build (Volume V) The corpus principle this enacts: A team that reports its own SLA breach before the client notices builds trust. A team that hopes the client didn't notice erodes it irreversibly.
Where it lives in the chain
The two layers
| Layer | When | Who acts | Action |
|---|---|---|---|
| Early warning | SLA threshold approached (not crossed) | On-call + PO | Investigate the trend; communicate proactively if risk is real. |
| Breach | SLA threshold crossed | Tech Lead + PO + Communicator | Incident process: contain, communicate, resolve. Client notified first. |
The breach protocol
- Detect — automated alert fires when the threshold crosses. Same SLI dashboards, additional alert layer.
- Contain — same four levers as any incident: flag off → deploy rollback → migration rollback → data correction.
- Communicate — the PO reaches out to the client proactively, within minutes. "We are seeing X. We are doing Y. We will tell you Z by W."
- Resolve — root cause identified, fix deployed, monitoring confirms recovery.
- Postmortem — same week. SLA breaches are always P-level enough to warrant the structural-fix discipline.
- Make right — credits, named recovery commitments, follow-up SLA review.
What good practice looks like
The dashboard shows latency creeping toward the SLA threshold. Early warning fires. The PO sends the client: "We're seeing increased latency on submissions; investigating. Will update in 30 minutes." The team finds the cause — a slow query — and rolls back the relevant deploy. The threshold never crosses. The client received transparency without panic; the team owned the moment before the client had to ask.
The JWT outage was an SLA breach. The team's response — the proactive client comms during the 44-minute resolution, the postmortem within 48 hours, the structural-fix commitment (environment-gated deployments, token compatibility smoke test) — is what limited the trust cost. The breach was contained by the protocol, not by the absence of the breach.