Pipeline & Operations · master area
Feature Flag Platform
Runtime toggles, targeting rules, audit trail, SDK. Default-off when the platform is unreachable — silent reachability failure must not silently enable behaviour.
Owners: Tech Lead, DevOps Phase it lives in: How We Build (Volume IV) The corpus principle this enacts: Rollback possible is not. A confirmed time is.
Where it lives in the chain
- How We Build · Feature Flags — the canon
- How We Build · Gradual rollout — how the platform supports staged release
What the platform provides
- Runtime toggle — flip a flag and have all consumers respond within seconds. Not minutes. Not after restart.
- Targeting — by user, segment, percentage, environment. "On for 10% of users in pilot region; everyone else default-off."
- Audit trail — every flag change recorded with who, when, why. The postmortem reads the audit when an incident traces to a flag flip.
- SDK — typed, with sensible defaults. Default-off when the SDK can't reach the platform. The wallet bug would have shipped a lot worse if the flag was default-on under reachability failure.
- Approval gates for high-risk flags — production-on requires two approvers; staging-on requires one.
How to do this
- Choose vs build — buy unless the platform itself is your product. Building flag infrastructure is a known sinkhole that pays back at scale most teams never reach.
- One source of truth — the flag definition lives in source control alongside the code that reads it. The platform's UI mirrors the file; the file is authoritative. Diff-on-deploy catches drift.
- Targeting rules are versioned — not edited in the UI without a record. Production targeting changed in the UI without a PR is shadow deployment.
What good practice looks like
A team uses the flag platform for gradual rollout — 1% → 5% → 25% → 100% — watching the dashboards between each step. The first 1% is the most expensive observation the team will ever make; it surfaces issues at low blast radius. The platform makes that observation cheap and reversible.
The wallet bug shipped to staging behind a flag. That is what flags are for. The JWT outage shipped to production simultaneously across all environments because the change was not flag-wrapped. The lesson: anything that can lock out users deserves the flag.
Related crafts
- Feature Flag Implementation — the code half
- Flag Lifecycle — birth → cleanup