AI Feature Flags Cheatsheet: Rollout, Rollback, Observe
For: A CTO at a Series A B2B SaaS company who has just had a bad AI feature incident — a model update silently degraded outputs for a subset of tenants — and is now designing a proper feature flag system that actually accounts for AI-specific failure modes, not just on/off user targeting
Standard feature flag tools gate on user IDs and percentages. That's fine for a new settings page. It's dangerous for a model swap — because the failure mode for AI isn't "the button doesn't render," it's "outputs look right but are subtly wrong for 12% of tenants." This cheatsheet is what to wire up after that incident so it doesn't happen twice.
Core shift: the flag target is the inference request, not the user. And the kill switch needs a live quality signal, not a Slack message from support.
1. What a UI feature flag misses for AI
| Dimension | UI flag tool | What AI needs |
|---|---|---|
| Targeting unit | User / account | Inference request + tenant + input class |
| Rollout signal | % of users | % of requests, weighted by risk tier |
| Kill switch | Manual toggle | Auto-trip on confidence / drift / cost |
| Observability | Click events, errors | Output distribution, eval scores, latency, token cost |
| Rollback unit | Code version | Model version + prompt version + retrieval index version |
| Failure mode | Loud (500s, crashes) | Silent (plausible-but-wrong outputs) |
2. The four flag types every AI surface needs
- Model version flag — pins which model (gpt-4o-2024-08-06, claude-sonnet-4.5, internal-llm-v3) handles the request. Versioned, not boolean.
- Prompt/chain flag — selects the prompt template or agent graph. Decoupled from model flag so you can change one without the other.
- Retrieval flag — gates which index, embedding model, or reranker is live. Critical: a re-embedded index is a silent breaking change.
- Behavior flag — toggles guardrails, fallback policy, max tokens, temperature. The lever you actually want during an incident.
Keep them independent. When something breaks, you need to roll back one axis without re-deploying the others.
3. Rollout strategy: a sane progression
| Stage | Traffic | Gate to next |
|---|---|---|
| Shadow | 100% of requests, response discarded | Eval scores ≥ baseline on offline + live shadow set |
| Internal | Employee tenants only | No regression on golden set after 48h |
| Canary | 1–5% of requests, low-risk tenants | Quality, latency, cost within thresholds |
| Ramp | 10% → 25% → 50% | Per-tenant drift checks pass at each step |
| Full | 100% with old version warm | 7-day soak before deprecating old |
Two non-obvious rules:
- Shadow first, always. A model that scores well on your eval set can still blow up on your live input distribution. Shadow mode catches the gap.
- Don't ramp uniformly. Bucket tenants by risk (data sensitivity, contract size, output-criticality). High-risk tenants get the new version last and with explicit opt-in.
4. Rollback triggers — what should auto-trip
Each trigger needs a threshold, a window, and a default action. Don't page a human first; flip the flag first, page second.
| Signal | Example threshold | Action |
|---|---|---|
| Mean output confidence (logprob or self-eval) | Drops >15% vs 7d baseline | Pause ramp, alert |
| Eval score on live golden set | Below floor for 2 consecutive windows | Auto-rollback |
| Output length / structure drift | JSON parse failure rate > 0.5% | Auto-rollback |
| Refusal / safety trigger rate | 2x baseline | Auto-rollback |
| Per-tenant variance | Any tenant >3σ from cohort | Exclude tenant, alert |
| p95 latency | >1.5x baseline | Pause ramp |
| Cost per request | >1.3x baseline | Pause ramp |
| User thumbs-down rate | >1.5x baseline | Alert + manual review |
The per-tenant trigger is the one most teams skip. Aggregate metrics look fine while one tenant's outputs go sideways — exactly the failure pattern that prompted this post.
5. Observability: the minimum log line
For every inference, persist:
{
request_id, tenant_id, user_id,
model_flag_version, prompt_flag_version,
retrieval_flag_version, behavior_flag_version,
input_hash, input_class,
output, output_tokens, output_structure_valid,
latency_ms, cost_usd,
confidence_score, self_eval_score,
guardrail_triggered, fallback_used,
user_feedback (nullable)
}
Three queries you should be able to run in under a minute:
- Output quality by flag version, sliced by tenant, last 24h.
- Drift between current and previous version on the same input_hash (replay set).
- Tenants whose distribution shifted >X% after a flag change.
If your current stack can't answer these, your flag system is theater. Tools worth looking at: LaunchDarkly or Statsig for the flag plane, OpenTelemetry + Clickhouse or Honeycomb for the data plane, Langfuse or Arize for AI-specific traces and evals.
6. The rollback unit problem
"Roll back the model" is ambiguous. A real rollback needs to revert the tuple:
(model_version, prompt_version, retrieval_index_version, behavior_config)
- Keep the previous tuple warm for at least 7 days post-ramp.
- Store prompts in version control, not in a database row someone can edit.
- Treat the retrieval index as a versioned artifact. New embedding model = new index = new flag.
- Make rollback a single flag flip, not a redeploy. If it requires a PR, it won't happen in time.
7. Honest tradeoffs
What this approach is bad at:
- Cost. Shadow traffic doubles inference spend during evaluation windows. Budget for it.
- Complexity. Four flag axes means a combinatorial space. You need a registry and a policy for which combinations are valid.
- False rollbacks. Aggressive auto-triggers will flip on real-but-benign distribution shifts (a new customer with weird inputs). Tune for a week before trusting auto-rollback in production.
- Eval set rot. Your golden set goes stale. Schedule a quarterly review or it stops catching regressions.
Teams building serious AI surfaces — credit decisioning at Cashpo, drug interaction checks at HealthPotli — converge on this pattern not because it's elegant but because the cost of a silent regression in a regulated workflow is much higher than the engineering overhead. If you're designing this from scratch, our AI studio team has notes.
Frequently Asked Questions
Can I use LaunchDarkly or Statsig for AI feature flags?
Yes, as the control plane. They handle targeting, percentages, and the flag API well. What they don't do is evaluate output quality or trigger rollbacks from model signals — you wire that yourself by having your inference service read flags from them and report metrics to a separate observability stack (Langfuse, Arize, or homegrown on Clickhouse).
What's the difference between a canary deployment for AI and for regular code?
Regular canary watches error rates and latency — both loud signals. An AI canary has to watch silent signals: output distribution drift, confidence drops, structural validity, and per-tenant variance. A model can return 200 OK with perfectly formed garbage, so HTTP-level health checks are insufficient.
How do I roll back a fine-tuned model without losing recent training data?
Treat model artifacts as immutable, versioned objects. The previous fine-tune stays addressable; rollback is a pointer change, not a retraining. Training data accumulated since the bad version is preserved separately and can be replayed against whichever base you re-fine-tune from next.
Should every AI feature have an auto-rollback trigger?
Every production AI feature should have at least one — structural validity (does the output parse?) is the cheap minimum. Quality-score-based auto-rollback is worth the effort for high-traffic or high-stakes features. For internal tools or low-volume features, alert-only is often enough.
How long should we run shadow mode before promoting a new model?
Long enough to cover your input diversity — usually a full business cycle (one week minimum for most B2B SaaS, longer if you have weekly or monthly batch patterns). For specifics on your traffic profile and risk tolerance, contact CodeNicely for a personalized assessment.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)