Startups SaaS May 14, 2026 • 6 min read

AI Feature Flags Cheatsheet: Rollout, Rollback, Observe

For: A CTO at a Series A B2B SaaS company who has just had a bad AI feature incident — a model update silently degraded outputs for a subset of tenants — and is now designing a proper feature flag system that actually accounts for AI-specific failure modes, not just on/off user targeting

Standard feature flag tools gate on user IDs and percentages. That's fine for a new settings page. It's dangerous for a model swap — because the failure mode for AI isn't "the button doesn't render," it's "outputs look right but are subtly wrong for 12% of tenants." This cheatsheet is what to wire up after that incident so it doesn't happen twice.

Core shift: the flag target is the inference request, not the user. And the kill switch needs a live quality signal, not a Slack message from support.

1. What a UI feature flag misses for AI

Dimension	UI flag tool	What AI needs
Targeting unit	User / account	Inference request + tenant + input class
Rollout signal	% of users	% of requests, weighted by risk tier
Kill switch	Manual toggle	Auto-trip on confidence / drift / cost
Observability	Click events, errors	Output distribution, eval scores, latency, token cost
Rollback unit	Code version	Model version + prompt version + retrieval index version
Failure mode	Loud (500s, crashes)	Silent (plausible-but-wrong outputs)

2. The four flag types every AI surface needs

Model version flag — pins which model (gpt-4o-2024-08-06, claude-sonnet-4.5, internal-llm-v3) handles the request. Versioned, not boolean.
Prompt/chain flag — selects the prompt template or agent graph. Decoupled from model flag so you can change one without the other.
Retrieval flag — gates which index, embedding model, or reranker is live. Critical: a re-embedded index is a silent breaking change.
Behavior flag — toggles guardrails, fallback policy, max tokens, temperature. The lever you actually want during an incident.

Keep them independent. When something breaks, you need to roll back one axis without re-deploying the others.

3. Rollout strategy: a sane progression

Stage	Traffic	Gate to next
Shadow	100% of requests, response discarded	Eval scores ≥ baseline on offline + live shadow set
Internal	Employee tenants only	No regression on golden set after 48h
Canary	1–5% of requests, low-risk tenants	Quality, latency, cost within thresholds
Ramp	10% → 25% → 50%	Per-tenant drift checks pass at each step
Full	100% with old version warm	7-day soak before deprecating old

Two non-obvious rules:

Shadow first, always. A model that scores well on your eval set can still blow up on your live input distribution. Shadow mode catches the gap.
Don't ramp uniformly. Bucket tenants by risk (data sensitivity, contract size, output-criticality). High-risk tenants get the new version last and with explicit opt-in.

4. Rollback triggers — what should auto-trip

Each trigger needs a threshold, a window, and a default action. Don't page a human first; flip the flag first, page second.

Signal	Example threshold	Action
Mean output confidence (logprob or self-eval)	Drops >15% vs 7d baseline	Pause ramp, alert
Eval score on live golden set	Below floor for 2 consecutive windows	Auto-rollback
Output length / structure drift	JSON parse failure rate > 0.5%	Auto-rollback
Refusal / safety trigger rate	2x baseline	Auto-rollback
Per-tenant variance	Any tenant >3σ from cohort	Exclude tenant, alert
p95 latency	>1.5x baseline	Pause ramp
Cost per request	>1.3x baseline	Pause ramp
User thumbs-down rate	>1.5x baseline	Alert + manual review

The per-tenant trigger is the one most teams skip. Aggregate metrics look fine while one tenant's outputs go sideways — exactly the failure pattern that prompted this post.

5. Observability: the minimum log line

For every inference, persist:

{
  request_id, tenant_id, user_id,
  model_flag_version, prompt_flag_version,
  retrieval_flag_version, behavior_flag_version,
  input_hash, input_class,
  output, output_tokens, output_structure_valid,
  latency_ms, cost_usd,
  confidence_score, self_eval_score,
  guardrail_triggered, fallback_used,
  user_feedback (nullable)
}

Three queries you should be able to run in under a minute:

Output quality by flag version, sliced by tenant, last 24h.
Drift between current and previous version on the same input_hash (replay set).
Tenants whose distribution shifted >X% after a flag change.

If your current stack can't answer these, your flag system is theater. Tools worth looking at: LaunchDarkly or Statsig for the flag plane, OpenTelemetry + Clickhouse or Honeycomb for the data plane, Langfuse or Arize for AI-specific traces and evals.

6. The rollback unit problem

"Roll back the model" is ambiguous. A real rollback needs to revert the tuple:

(model_version, prompt_version, retrieval_index_version, behavior_config)

Keep the previous tuple warm for at least 7 days post-ramp.
Store prompts in version control, not in a database row someone can edit.
Treat the retrieval index as a versioned artifact. New embedding model = new index = new flag.
Make rollback a single flag flip, not a redeploy. If it requires a PR, it won't happen in time.

7. Honest tradeoffs

What this approach is bad at:

Cost. Shadow traffic doubles inference spend during evaluation windows. Budget for it.
Complexity. Four flag axes means a combinatorial space. You need a registry and a policy for which combinations are valid.
False rollbacks. Aggressive auto-triggers will flip on real-but-benign distribution shifts (a new customer with weird inputs). Tune for a week before trusting auto-rollback in production.
Eval set rot. Your golden set goes stale. Schedule a quarterly review or it stops catching regressions.

Teams building serious AI surfaces — credit decisioning at Cashpo, drug interaction checks at HealthPotli — converge on this pattern not because it's elegant but because the cost of a silent regression in a regulated workflow is much higher than the engineering overhead. If you're designing this from scratch, our AI studio team has notes.

Frequently Asked Questions

Can I use LaunchDarkly or Statsig for AI feature flags?

Yes, as the control plane. They handle targeting, percentages, and the flag API well. What they don't do is evaluate output quality or trigger rollbacks from model signals — you wire that yourself by having your inference service read flags from them and report metrics to a separate observability stack (Langfuse, Arize, or homegrown on Clickhouse).

What's the difference between a canary deployment for AI and for regular code?

Regular canary watches error rates and latency — both loud signals. An AI canary has to watch silent signals: output distribution drift, confidence drops, structural validity, and per-tenant variance. A model can return 200 OK with perfectly formed garbage, so HTTP-level health checks are insufficient.

How do I roll back a fine-tuned model without losing recent training data?

Treat model artifacts as immutable, versioned objects. The previous fine-tune stays addressable; rollback is a pointer change, not a retraining. Training data accumulated since the bad version is preserved separately and can be replayed against whichever base you re-fine-tune from next.

Should every AI feature have an auto-rollback trigger?

Every production AI feature should have at least one — structural validity (does the output parse?) is the cheap minimum. Quality-score-based auto-rollback is worth the effort for high-traffic or high-stakes features. For internal tools or low-volume features, alert-only is often enough.

How long should we run shadow mode before promoting a new model?

Long enough to cover your input diversity — usually a full business cycle (one week minimum for most B2B SaaS, longer if you have weekly or monthly batch patterns). For specifics on your traffic profile and risk tolerance, contact CodeNicely for a personalized assessment.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.