part one · the first 48 hours
The First 48 Hours
The period between the flag being enabled and the first honest picture. What to watch, what to act on, and what to let settle.
On-call rotation is active for the 48 hours after the flag enables — that was a release-gate condition. Watching is loose during the first hour, sharper after, then settles into normal-cadence dashboard checks. No new ceremony — this is a heightened state of normal flow.
This is where meaning meets the world for the first time. The flag is enabled. The feature is live. Volume IV's machinery is now the watching apparatus — runbooks armed, SLOs baselined, prediction "before" numbers captured.
The first 48 hours are the period when the team has the most attention and the least data. The instinct is to act on every signal. The discipline is not about reacting fast. It is about not reacting incorrectly. Acting early is not a sign of control. Acting correctly is. Knowing which signals warrant action and which need time to stabilise is the difference between a team that contains problems and a team that creates new ones.
What to watch
The monitoring dashboards are the primary source of truth. Not support tickets — dashboards. Support tickets lag reality by hours. The specific SLIs defined in the ADRs are what the team watches: error rates, latency percentiles, queue depths. The SLO thresholds trigger action. The leading signals from Volume IV — adoption, completion, error encounter rate — tell the early story.
The first hour is the noisiest. People click things in unexpected orders, submit forms twice, navigate away mid-flow. Some of this produces errors that are not bugs — they are the normal shape of first contact. The question is not whether errors are occurring. It is whether the error rate is above the SLO threshold and trending up.
When to act
Three conditions warrant immediate action: SLO threshold crossed for more than 5 minutes — open the runbook, start from step one. Any data integrity concern — disable the flag immediately, investigate in staging. Any security-relevant behaviour — disable the flag, full stop. Everything else is logged, prioritised using the bug taxonomy, and addressed in normal flow.
By hour 48, the noisy first-contact patterns have settled. The team has a first honest picture — not the prediction check yet, but the data the prediction check will draw from.
Enough to know the feature is live and stable.
Dashboards are within SLO. No P0 incidents open. The "before" baseline is captured. Early usage patterns are visible.