Skip to content

Gradual Rollout

Pilot, percentage ramp, full enablement.

Gradual rollout is the discipline of enabling a feature progressively, with observation between each step, instead of flipping a flag for everyone at once. The chain that does this catches problems while they are small. The chain that doesn't catches problems after they have been everyone's problem for an hour.

The shape of a rollout

The default sequence:

Each step has exit criteria — what must be true before moving to the next.

StepExit criteria
InternalInternal team uses the feature for one full work session. No errors. No negative surprises.
1 userPilot user uses for 1–2 days. Observed by PO + Designer. No critical issues.
5 users / 1 customerOne full grading cycle for the customer. Support volume normal. SLO holds.
10%Two days at 10%. Error rate within SLO. Adoption metric trending up.
50%Two days at 50%. Same checks at scale.
100%Holds for the full first 48 hours (Volume V Part 1).

The exit criteria are written before the rollout begins. They live in the release brief. Not invented during enablement.

Why pilots first

A pilot is a named person, observed in the field, using the feature for the activity it was designed for. Not a beta-test group. Not a focus session. The pilot is the smallest version of Volume V Part 2 — the prediction is checked at small scale.

The pilot's value:

  • Surfaces the obvious miss before it is everyone's miss.
  • Builds the customer trust that the team is being careful.
  • Produces the first signal reading at low risk.

A pilot of one named person who has been part of Discovery is better than a percentage rollout. Percentage rollouts have anonymous signal. Named pilots have legible signal — Gal hit the workaround again at J6, three times yesterday is a story; 0.3% of users encountered an error is a metric.

When percentage rollouts add value

After the pilot. Percentage rollouts test scale, not behavior. They surface issues that only appear under load — flaky third-party integrations, hot-spot DB queries, cache stampedes.

Percentage rollouts are observed via SLO dashboards, not field observation. The pilot already verified the human side; percentage verifies the system side.

Targeting strategies

The flag's targeting (Part 3) implements the rollout. Common patterns:

  • By customer — enables one customer at a time. Useful when customers have different shapes.
  • By region — enables one region at a time. Useful for geographic latency or regulatory differences.
  • By percentage with sticky assignment — the same user always gets the same path during the rollout. Important — flapping a user between flag-on and flag-off is the corpus's worst experience for the named person.
  • By role — enables for one role across all customers (e.g., graders only, not students).

Sticky assignment is the default. The team should consciously choose otherwise.

The release brief to the client

Before the rollout begins, the PO writes a short release brief to the client.

text
RELEASE BRIEF — Grading Flow v2

Date:           2026-06-01
Owner:          Alex (PO)
For:            [Client lead], [Customer ops]

What's changing
  Hebrew name handling improvements in the grading flow.
  Specifically: graders no longer need to edit names manually
  in the spreadsheet workaround.

What to expect
  - Pilot phase (next 5 days): 1 grader at the flagship campus.
  - 10% rollout (days 6-7): randomly selected graders.
  - 50% (days 8-9), 100% (day 10).

What is not yet available
  - Bulk re-rendering of past grading reports (planned for cycle 18).

If something goes wrong
  - Disable is one switch. We will tell you within 30 minutes.
  - Status page: https://status.acme.example

Contact
  Alex (PO), Yossi (Tech Lead), on-call rotation: 200apps-grading

The release brief is shared before enablement, not after. Anxiety arrives when communication arrives late.

The CS handoff

The CS team learns about the change before the customer does. The handoff is a separate document.

text
CS HANDOFF — Grading Flow v2

What's new
  - Native Hebrew name support; the spreadsheet workaround is no
    longer needed.

What might surface in tickets
  - Customers reporting that names look "different" — they look correct.
  - Customers asking about past reports (out of scope, see brief).
  - Edge case: rare unicode form falls back gracefully (see runbook).

Likely questions
  Q: Will my old grading reports be updated?
  A: No, the change applies to grading from [date] forward.
     Bulk re-rendering is planned for the next cycle.

  Q: I see a warning icon on a name. What does that mean?
  A: A rare unicode form was found. Grade can proceed normally; the
     name renders with a fallback character. Engineering is notified.

Escalation path
  L1 → L2 (QA + on-call dev) → L3 (Tech Lead + PO)
  Status page: https://status.acme.example

CS reading the handoff and asking questions before customers do is what makes the rollout calm.

What can stop a rollout

At any step, any of these stops the rollout:

  • The error rate exceeds SLO.
  • A pilot user reports a critical issue.
  • A support ticket pattern emerges that wasn't predicted.
  • The on-call is paged for the new flow.

The flag is disabled. The team investigates. The rollout resumes when the issue is closed and the runbook is updated.

The corpus pattern: stopping a rollout is normal. It is not failure. The team that has never paused a rollout is either a team with very simple features or a team that isn't watching closely.

Part 8 — Runbooks & Rollback →

200apps · How We Work · NWIRE