Startups SaaS May 8, 2026 • 13 min read

How to Run a Shadow Deployment Before Your AI Feature Goes Live

Q: What if my AI feature has no existing production behavior to compare against?

Common for net-new features. You don't compare against an existing system; you compare against your acceptance criteria. Define what good output looks like, run the model on real production inputs, sample for human or judge review, and gate the launch on whether absolute quality clears your bar. Shadow still surfaces input distribution, latency tails, and cost projections staging can't.

Q: How much traffic do I need to mirror — 100% or a sample?

Depends on what you're hunting. For aggregate metrics like agreement rate, latency percentiles, and cost projections, a 10–25% sample is usually enough. For rare-event detection — catastrophic failures that occur in 0.1% of inputs — you want closer to 100%. Start at a sample, watch your cost burn, and increase if your tail metrics are noisy.

For: A senior engineer or technical product lead at a Series A SaaS company who is two weeks from shipping their first consequential AI feature — one that affects user-facing decisions — and knows that staging tests passed but does not trust that staging reflects production traffic well enough to ship blind

You're two weeks from shipping. Staging is green. Your eval set scores look fine. And you still don't want to push the button — because you've been around long enough to know that staging traffic is a sanitized fiction and the first hour of real users will surface inputs your test fixtures never imagined. A feature flag is not the answer. A feature flag protects users from seeing a bad output; it cannot tell you the output was bad. That's what shadow deployment is for.

This playbook is for a specific situation: you have a model or LLM-backed feature that affects user-facing decisions — a recommendation, a classification, a generated response, a risk score — and you want to run it against real production traffic before any user sees its output. If that's you, here's how to do it without burying your team in tooling work.

What shadow deployment actually means (and what it doesn't)

Shadow deployment, sometimes called dark launching an ML model or shadow mode testing, is the practice of running your new model in parallel with whatever currently serves the request — including "nothing" — capturing its outputs, and discarding them before they reach the user. Production inputs flow into the new model. The new model's outputs flow into a logging sink. Users see the existing behavior.

What it gives you: real input distributions, real latency under real load, real failure modes, real cost-per-request numbers, and a side-by-side log of what your model would have done.

What it does not give you: a measure of user response. Nobody clicked anything. Nobody accepted or rejected the suggestion. You're testing the model's behavior, not the feature's reception. That comes later, in a canary or A/B test. Don't conflate them.

One more honest tradeoff: shadow mode roughly doubles your inference cost and traffic for the duration of the test. For a tiny model that's nothing. For a GPT-4-class call on every request, it's real money. Plan for it.

The playbook

Step 1. Decide what "agreement" means before you write a line of shadow code

The most common failure I see: teams stand up shadow infra, capture a million predictions, then sit in a Slack thread for a week arguing about whether the model is "doing well." You have to define the comparison before you collect data.

Pick a small number of measurable comparisons. For a classifier replacing a rules engine: agreement rate, per-class precision/recall against a labeled subset, distribution of confidence scores. For an LLM generating customer-facing text: a rubric (factuality, tone, format compliance) scored by a judge model on a sampled subset, plus deterministic checks (JSON parses, length bounds, refusal rate). For a recommender: top-k overlap with the current system, plus catalog coverage.

Write down the threshold that means "ship" and the threshold that means "go back." If you can't articulate it now, you won't articulate it under deadline pressure either.

Anti-pattern: "We'll just look at the logs and see." You will look at the logs, see something weird, rationalize it, and ship.

You'll know this step is done when you have a one-page doc with: the metrics, the sampling strategy, the labeling source (human, judge LLM, or existing ground truth), and numeric pass/fail thresholds.

Step 2. Mirror traffic asynchronously — never synchronously

Two architectures to choose between, and one is almost always wrong.

Synchronous mirror (avoid): your request handler calls both the production path and the shadow path, waits for both, returns the production response. This is simple to build and adds shadow latency to every user request. If your shadow model has a bad day, your users have a bad day. Don't do this.

Asynchronous mirror (do this): your request handler returns the production response. A side-effect — a Kafka topic, a Kinesis stream, an SQS queue, or even a fire-and-forget HTTP call to an internal service — fans the request out to the shadow model. Shadow inference happens off the critical path. Failures and slowness in shadow are invisible to users.

The implementation detail that matters: capture the full input, not just the obvious fields. Include the user context, the feature flags active at request time, the model version on the production side, and a request ID that ties the shadow log back to the production log. You will need all of this when something looks wrong.

Anti-pattern: sampling to 1% on day one to "keep costs down," then realizing your tail behaviors only show up at 50%+ traffic.

You'll know this step is done when you can show a graph of shadow QPS tracking production QPS within 1% over a 24-hour window, and your p99 production latency is unchanged.

Step 3. Log the right things, in a queryable place

Shadow data is worthless if it lives in S3 as gzipped JSON nobody opens. You need three things in a warehouse or analytics store (Snowflake, BigQuery, ClickHouse, even Postgres for low volumes):

Per-request rows: request_id, timestamp, input hash, input payload (or a pointer to it), shadow output, shadow latency, shadow cost (token counts × rate), shadow model version, production output if one exists.
Disagreement flags: a precomputed boolean or score for each comparison metric you defined in Step 1. Compute it at write time, not at query time, so dashboards stay fast.
Sampling for human review: a small randomized subset (1–5%) plus an oversampled slice of disagreements, surfaced in a labeling tool — Argilla, Label Studio, or even a Retool front end over your warehouse.

For LLM features specifically, log the full prompt and full completion. Yes, it's a lot of bytes. You will regret not having them the first time output quality drops and you can't reproduce the bad case.

You'll know this step is done when a non-author on your team can answer the question "show me the 50 worst shadow disagreements from yesterday" with a single SQL query.

Step 4. Run for long enough to see the weird stuff

The shortest useful shadow window is a full business cycle for your product. For most B2B SaaS, that's at least one full week — Monday morning traffic looks nothing like Saturday night, and a Tuesday-only test will lie to you. For consumer apps with weekly seasonality, the same. For products with monthly billing cycles or end-of-month workflows, longer.

During this window, watch for:

Input distribution drift between staging and production. Plot input feature distributions side by side. This is where most teams get their first "oh no" moment — staging had a clean even split across plan tiers, production is 78% free-tier with malformed inputs your validators don't catch.
Latency at the tail. Median latency tells you nothing useful. Watch p95, p99, p99.9. LLM calls especially have ugly tails when prompts get long.
Cost per 1k requests, projected to full traffic. Most teams discover their unit economics are wrong here, not in production.
Failure modes you didn't enumerate. Timeouts, rate limit hits, content filter blocks, malformed JSON from the LLM, retry storms.

Anti-pattern: running shadow for 48 hours, declaring victory, shipping, then discovering on Monday that 30% of your B2B customers send batch requests at 9am that blow up your prompt budget.

You'll know this step is done when you've seen at least one full weekly cycle, p99 latency is stable, and the disagreement rate has plateaued (not still falling — meaning you'd benefit from more time).

Step 5. Triage disagreements with a real labeling pass

You will end the shadow window with a disagreement list. Some of these disagreements are your new model being wrong. Some are your old model (or your old rules) being wrong, and the new model is actually an improvement. You cannot tell which is which without labeling.

Pull a stratified sample: ~100 disagreements, ~50 agreements as control, oversample edge cases (low confidence, long inputs, rare classes). Get them labeled by someone who knows the domain — a support engineer, a product manager, a domain expert, not a random Mechanical Turk worker unless your task is genuinely generic.

Then split the disagreements into four buckets:

New model right, old model wrong — wins. Count these.
New model wrong, old model right — losses. These are the ones to study.
Both right, different valid answers — common in generative tasks. Decide whether the diversity is acceptable.
Both wrong — your evaluation set was lying to you. Add these to it.

If wins clearly outnumber losses and the losses cluster around fixable patterns, you have a path forward. If losses are random or include any catastrophic ones (PII leakage, harmful content, regulatory issues), you don't ship — you fix and re-shadow.

You'll know this step is done when every loss-bucket case has either a fix in your backlog or a documented decision that the failure mode is acceptable.

Step 6. Define the canary plan before you turn shadow off

Shadow tells you the model behaves reasonably on real inputs. It cannot tell you how users will respond. The next step is a canary — 1%, then 5%, then 25% of users actually seeing the output — gated on live metrics.

Before you flip the canary on, write down: which metrics tell you to roll back, what the rollback procedure is (a feature flag toggle, ideally one operator can flip without a deploy), and who is on call to watch the dashboards for the first 24 hours. Keep the shadow logging running through canary so you can compare "what canary users saw" with "what the model would have output for everyone."

You'll know this step is done when someone other than the model author could roll the feature back at 3am with a runbook.

Step 7. Keep shadow running after launch — for the next version

The instinct after a successful launch is to tear down the shadow infrastructure. Don't. Repurpose it. The next model version, the next prompt change, the next provider switch (GPT-4 to Claude, OpenAI to a self-hosted Llama variant) all deserve the same treatment. Teams that keep shadow as a permanent piece of their ML platform ship faster on every subsequent change because the cost of validation is amortized.

If you're using a feature platform (LaunchDarkly, Statsig, Unleash) or an experimentation platform (Eppo, Optimizely), most have shadow or holdout primitives you can lean on instead of rebuilding. If you're rolling your own, the asynchronous-fanout pattern from Step 2 is the only piece you really need; everything else is logging and analysis.

Failure modes I've seen

The "staging was lying" failure. Team builds a great eval harness on staging data, ships behind a flag, turns it on, immediately gets paged. Real production inputs included multi-language content, attachments, malformed Unicode, and prompts 10x longer than any staging fixture. Shadow would have caught all of it in the first hour.

The synchronous mirror tax. Team mirrors traffic synchronously "just for a few days." Their LLM provider has a slow afternoon. p99 latency on the main product doubles. Customers notice. The shadow gets ripped out and the team becomes shadow-skeptical for a year. Always async.

The disagreement avalanche. Team turns on shadow, sees 40% disagreement with the existing rules engine, panics, assumes the model is broken. Turns out the rules engine had been wrong about 30% of the time and nobody had ever measured. Without labeled ground truth, a disagreement rate is just a number.

The cost surprise. Team forgets that shadow doubles inference traffic. Burns through the month's API budget in a week. Either budget for it explicitly or sample (10–25% of traffic is usually enough for statistical confidence on aggregate metrics; you only need full traffic if you're hunting rare events).

The "we'll review the logs later" failure. Shadow runs for two weeks, captures gigabytes of data, nobody looks at it because everyone is busy with the launch. Schedule the review. Put it on the calendar before you start shadow.

How CodeNicely can help

Most of our AI engagements involve features where being wrong has consequences — drug interaction warnings, credit decisions, financial reconciliations. The closest match to a Series A SaaS team shipping their first consequential AI feature is the work we did on Cashpo, where the credit scoring model affected real lending decisions and we couldn't ship blind. We ran the new model in shadow against the production rules-based system for several weeks, labeled disagreements with the credit team, and only promoted to canary once the win/loss ratio and the failure-mode distribution were both acceptable. Most of the engineering work was not the model — it was the logging, the labeling tooling, and the comparison dashboards. That's the part teams underestimate.

If you're standing up shadow infrastructure for the first time and want a second pair of eyes on the comparison metrics, the sampling strategy, or the warehouse schema, our AI studio team has done this enough times to skip the obvious mistakes. Talk to us for a personalized assessment of your specific feature and traffic profile.

Frequently Asked Questions

How is shadow deployment different from a feature flag or canary release?

A feature flag controls who sees the output. A canary release exposes a small percentage of real users to the new behavior. Shadow deployment exposes zero users — it runs the new model on real production inputs and discards the outputs after logging them. You typically use them in sequence: shadow first to validate the model behaves sanely, then canary to validate users respond well.

How long should I run shadow mode before shipping?

At minimum one full business cycle for your product, which for most B2B SaaS is a full week including a weekend. Long enough to see traffic seasonality, batch jobs, and end-of-period workflows. Stop when your disagreement rate has plateaued and you've seen no novel failure modes for several days running.

Does shadow deployment work for LLM features, or only for traditional ML models?

It works for both, with adjustments. For LLMs the comparison is harder because outputs are generative — there's no single "right" answer. You typically combine deterministic checks (format compliance, refusal rate, length, JSON validity) with a rubric-based judge model scoring a sampled subset, plus human review of edge cases. The infrastructure pattern (async fanout, full logging, warehouse-backed analysis) is identical.

What if my AI feature has no existing production behavior to compare against?

This is common for net-new features. You don't compare against an existing system; you compare against your acceptance criteria. Define what good output looks like (accuracy thresholds, format requirements, latency bounds), run the model on real production inputs, sample outputs for human or judge review, and gate the launch on whether the absolute quality clears your bar. The shadow infrastructure still earns its keep — it surfaces real input distribution, latency tails, and cost projections you can't get from staging.

How much traffic do I need to mirror — 100% or a sample?

Depends on what you're hunting. For aggregate metrics (agreement rate, latency percentiles, cost projections) a 10–25% sample is usually enough. For rare-event detection (catastrophic failures that occur in 0.1% of inputs) you want closer to 100%. Start at a sample, watch your cost burn, and increase if your tail metrics are noisy.

The thesis is simple: don't ship an AI feature whose behavior you've only verified on staging traffic. Real inputs are weirder, your latency tails are worse, and your eval set has gaps you don't know about yet. Shadow deployment is the cheapest insurance you can buy against the kind of launch incident that costs you a customer or a quarter. Build it once, keep it running, ship with confidence on every model change after.

Building something in SaaS?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team