SaaS technology
Startups SaaS May 14, 2026 • 6 min read

AI Feature Flags Cheatsheet: Rollout, Rollback, Observe

For: A CTO at a Series A B2B SaaS company who has just had a bad AI feature incident — a model update silently degraded outputs for a subset of tenants — and is now designing a proper feature flag system that actually accounts for AI-specific failure modes, not just on/off user targeting

Standard feature flag tools gate on user IDs and percentages. That's fine for a new settings page. It's dangerous for a model swap — because the failure mode for AI isn't "the button doesn't render," it's "outputs look right but are subtly wrong for 12% of tenants." This cheatsheet is what to wire up after that incident so it doesn't happen twice.

Core shift: the flag target is the inference request, not the user. And the kill switch needs a live quality signal, not a Slack message from support.

1. What a UI feature flag misses for AI

DimensionUI flag toolWhat AI needs
Targeting unitUser / accountInference request + tenant + input class
Rollout signal% of users% of requests, weighted by risk tier
Kill switchManual toggleAuto-trip on confidence / drift / cost
ObservabilityClick events, errorsOutput distribution, eval scores, latency, token cost
Rollback unitCode versionModel version + prompt version + retrieval index version
Failure modeLoud (500s, crashes)Silent (plausible-but-wrong outputs)

2. The four flag types every AI surface needs

Keep them independent. When something breaks, you need to roll back one axis without re-deploying the others.

3. Rollout strategy: a sane progression

StageTrafficGate to next
Shadow100% of requests, response discardedEval scores ≥ baseline on offline + live shadow set
InternalEmployee tenants onlyNo regression on golden set after 48h
Canary1–5% of requests, low-risk tenantsQuality, latency, cost within thresholds
Ramp10% → 25% → 50%Per-tenant drift checks pass at each step
Full100% with old version warm7-day soak before deprecating old

Two non-obvious rules:

4. Rollback triggers — what should auto-trip

Each trigger needs a threshold, a window, and a default action. Don't page a human first; flip the flag first, page second.

SignalExample thresholdAction
Mean output confidence (logprob or self-eval)Drops >15% vs 7d baselinePause ramp, alert
Eval score on live golden setBelow floor for 2 consecutive windowsAuto-rollback
Output length / structure driftJSON parse failure rate > 0.5%Auto-rollback
Refusal / safety trigger rate2x baselineAuto-rollback
Per-tenant varianceAny tenant >3σ from cohortExclude tenant, alert
p95 latency>1.5x baselinePause ramp
Cost per request>1.3x baselinePause ramp
User thumbs-down rate>1.5x baselineAlert + manual review

The per-tenant trigger is the one most teams skip. Aggregate metrics look fine while one tenant's outputs go sideways — exactly the failure pattern that prompted this post.

5. Observability: the minimum log line

For every inference, persist:

{
  request_id, tenant_id, user_id,
  model_flag_version, prompt_flag_version,
  retrieval_flag_version, behavior_flag_version,
  input_hash, input_class,
  output, output_tokens, output_structure_valid,
  latency_ms, cost_usd,
  confidence_score, self_eval_score,
  guardrail_triggered, fallback_used,
  user_feedback (nullable)
}

Three queries you should be able to run in under a minute:

  1. Output quality by flag version, sliced by tenant, last 24h.
  2. Drift between current and previous version on the same input_hash (replay set).
  3. Tenants whose distribution shifted >X% after a flag change.

If your current stack can't answer these, your flag system is theater. Tools worth looking at: LaunchDarkly or Statsig for the flag plane, OpenTelemetry + Clickhouse or Honeycomb for the data plane, Langfuse or Arize for AI-specific traces and evals.

6. The rollback unit problem

"Roll back the model" is ambiguous. A real rollback needs to revert the tuple:

(model_version, prompt_version, retrieval_index_version, behavior_config)

7. Honest tradeoffs

What this approach is bad at:

Teams building serious AI surfaces — credit decisioning at Cashpo, drug interaction checks at HealthPotli — converge on this pattern not because it's elegant but because the cost of a silent regression in a regulated workflow is much higher than the engineering overhead. If you're designing this from scratch, our AI studio team has notes.

Frequently Asked Questions

Can I use LaunchDarkly or Statsig for AI feature flags?

Yes, as the control plane. They handle targeting, percentages, and the flag API well. What they don't do is evaluate output quality or trigger rollbacks from model signals — you wire that yourself by having your inference service read flags from them and report metrics to a separate observability stack (Langfuse, Arize, or homegrown on Clickhouse).

What's the difference between a canary deployment for AI and for regular code?

Regular canary watches error rates and latency — both loud signals. An AI canary has to watch silent signals: output distribution drift, confidence drops, structural validity, and per-tenant variance. A model can return 200 OK with perfectly formed garbage, so HTTP-level health checks are insufficient.

How do I roll back a fine-tuned model without losing recent training data?

Treat model artifacts as immutable, versioned objects. The previous fine-tune stays addressable; rollback is a pointer change, not a retraining. Training data accumulated since the bad version is preserved separately and can be replayed against whichever base you re-fine-tune from next.

Should every AI feature have an auto-rollback trigger?

Every production AI feature should have at least one — structural validity (does the output parse?) is the cheap minimum. Quality-score-based auto-rollback is worth the effort for high-traffic or high-stakes features. For internal tools or low-volume features, alert-only is often enough.

How long should we run shadow mode before promoting a new model?

Long enough to cover your input diversity — usually a full business cycle (one week minimum for most B2B SaaS, longer if you have weekly or monthly batch patterns). For specifics on your traffic profile and risk tolerance, contact CodeNicely for a personalized assessment.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.