SaaS technology
Startups SaaS May 9, 2026 • 11 min read

Batch vs. Real-Time AI Inference: A Decision Framework

For: A Series A SaaS founder whose product has 3–4 distinct AI features with different latency profiles — one is a live recommendation, one is a nightly report, one is a fraud check — and they are defaulting every feature to the same real-time inference path because no one on the team has a principled framework for choosing otherwise

Most engineering teams shipping multiple AI features end up with the same problem: every model gets routed through a synchronous, real-time inference path because that's how the first feature was built and no one revisited the assumption. Then the cloud bill arrives, p99 latency on the checkout flow gets blamed on the recommendation service, and someone asks the obvious question: which of these features actually need to be real-time?

The honest answer is that the batch-vs-real-time decision is rarely about how fast your model runs. It's about whether a stale answer causes the user to take a worse action. Inference latency and decision freshness are orthogonal axes, and conflating them is what leads teams to overpay on compute while still having a sluggish product.

This piece is a decision framework. Define the choice, score the axes that matter, and end with rules you can apply to each AI feature in your product this week.

Define the decision precisely

There are really three architectural patterns, not two. Lumping them together is part of why teams get this wrong.

The three patterns have very different cost curves, failure modes, and engineering overhead. Picking the wrong one isn't just inefficient — it changes what your product can promise.

The five axes that actually matter

Forget the generic "latency vs cost" framing. These are the axes that determine the right answer for a specific feature.

1. Decision freshness requirement

Ask: if the prediction is 6 hours old, does the user make a worse decision? If 24 hours old? If 7 days old?

A churn score that's 24 hours stale is fine — customer behavior doesn't shift that fast, and your retention team acts on weekly cycles anyway. A fraud score that's 60 seconds stale is useless because the transaction is already authorized. A product recommendation that's 4 hours stale is usually fine on a content site and usually wrong on a flash-sale marketplace.

This axis is about the input data, not the model. If the features feeding the model don't change meaningfully within your refresh window, batch is fine regardless of how "AI-feeling" the output is.

2. Input cardinality and predictability

Can you enumerate the set of inputs in advance?

This is the axis most teams skip. A nightly report is batchable not because reports are slow but because the set of accounts to score is finite and known at 2am.

3. Cost per inference × call volume

Batch wins on cost when you can amortize warm GPUs, larger batch sizes, and spot instances. The cost gap can be an order of magnitude for the same model and the same hardware.

But cost only matters at volume. If a feature gets called 500 times a day, the operational complexity of standing up a batch pipeline isn't worth the savings. If it's called 5 million times a day, batch is potentially the difference between a healthy gross margin and a bad one.

4. Tail latency consequences

What happens at p99? At p99.9? Real-time inference paths inherit the worst-case latency of every dependency they touch — feature store, model server, cold-start GPU, downstream API. If a feature is on a user-blocking path, a p99 of 4 seconds is a UX problem even if your median is 80ms.

Batch decouples user latency from model latency entirely. The model can take 30 seconds per prediction; the user gets the cached result in 8ms. This is often the real reason to move a feature to batch — not cost, but the freedom to use a bigger, slower, more accurate model without a latency budget.

5. Failure mode and recovery

If the model service goes down for 10 minutes, what breaks?

For high-stakes features, batch's softer failure mode is genuinely valuable. For features where stale is worse than missing, real-time with a clear fallback is safer.

Scoring the three patterns honestly

Every option has downsides. Here's the honest version.

Real-time inference

Good at: arbitrary inputs, fresh decisions, simple mental model, fast iteration on the model itself (you can ship a new version and see results immediately).

Bad at: cost efficiency at high volume, tail latency, accommodating large or slow models, surviving infrastructure hiccups gracefully. Also bad at handling traffic spikes — you need autoscaling and headroom you pay for whether or not you use it.

Near-real-time (async) inference

Good at: workloads that take 2–60 seconds (LLM generations, document processing, multi-step agents), smoothing out traffic spikes via queues, allowing larger models without blocking the UI.

Bad at: UX complexity. You now have to build status polling, webhooks, partial results, retry logic, and an empty state that explains "we're working on it." Teams underestimate this. The infrastructure is easy; the product surface is not.

Batch inference

Good at: cost (often 5–10x cheaper for the same model), enabling larger models, predictable capacity planning, soft failure modes, simple serving (it's just a key-value lookup at runtime).

Bad at: handling new inputs that arrived after the last batch run, cold-start for new users (they have no precomputed prediction yet), reflecting recent behavior. Also adds operational surface area: a scheduler, a pipeline, a feature store, monitoring for stale or missing predictions, backfills when something breaks.

The decision rules

Apply these in order. The first one that matches wins.

Rule 1: If the input is unbounded, you cannot batch

Free-text search queries, user-uploaded images, novel transactions — there is no input set to enumerate. Use real-time or near-real-time. Pick based on how long the model takes: under ~500ms, real-time; over that, async with a webhook or polling.

This is the case for most fraud checks, search ranking, and content moderation on user uploads.

Rule 2: If staleness causes a worse user decision, go real-time

Even with bounded inputs, some decisions can't tolerate lag. A live recommendation on an ecommerce homepage where the user just added something to cart and you want to react to that signal — real-time. A pricing decision that depends on current inventory — real-time. A fraud check — obviously real-time.

The test: imagine the prediction was made 1 hour ago. Would the user take a meaningfully worse action because of the lag? If yes, you need real-time.

Rule 3: If the model is too slow or too expensive for real-time at your volume, go batch

If you've passed rules 1 and 2, you're now in the zone where batch is viable. The question is whether it's worth the operational overhead.

Heuristics that push toward batch:

Rule 4: For everything else, default to real-time and revisit at scale

If a feature has bounded inputs but low volume, the simplest thing is real-time. Don't build a batch pipeline you don't need. Revisit when the cost or latency actually shows up in metrics.

Applying the framework to a typical SaaS product

Take the three features in the brief: a live recommendation, a nightly report, a fraud check.

Live recommendation. If "live" means reacting to in-session behavior (current cart, last 3 clicks), it's real-time — staleness causes a worse decision (rule 2). If "live" really means "personalized homepage that refreshes every few hours," it's batch — same input set every night, no freshness penalty. Most teams call the second thing "real-time" out of habit and pay 10x for it. We've seen this exact pattern in marketplace and logistics products like the ones we worked on with Vahak, where some recommendations are session-driven and some are absolutely batch-friendly.

Nightly report. Bounded inputs (known accounts), no freshness pressure (the report is read once a day), high enough volume that running it on a real-time path is wasteful. Batch. This is the easy one and yet teams still run it through their inference API because that's the path that exists.

Fraud check. Real-time, no debate. Staleness is catastrophic, inputs are unbounded (every transaction is novel), and tail latency matters but is a constraint to engineer around, not avoid. The interesting question for fraud is whether some features feeding the model can be batch (e.g., a precomputed user risk score) while the final scoring is real-time. The answer is almost always yes, and it's how teams keep real-time fraud paths fast and cheap. Similar patterns show up in lending workflows like the ones in Cashpo's KYC and credit scoring stack.

The hybrid pattern most mature systems converge to

The endpoint for most teams isn't "all batch" or "all real-time." It's a hybrid where batch precomputes the expensive stuff and real-time does cheap composition.

A typical pattern:

  1. Nightly batch job computes user embeddings, item embeddings, and a candidate set per user.
  2. At request time, a fast real-time service takes the precomputed candidates, applies session signals (last click, current cart), and reranks with a small model.
  3. The real-time path is now doing 5ms of work instead of 500ms, on cached features instead of raw data.

This is what "good AI inference architecture" looks like in production. The batch layer carries the cost-heavy work; the real-time layer carries the freshness-sensitive work. Neither does the other's job.

What to do this week

List your AI features in a spreadsheet. For each one, fill in the five axes: freshness requirement, input cardinality, cost × volume, tail latency consequences, failure mode. Don't guess — pull the actual numbers from your observability stack.

Then apply the rules in order. You will almost certainly find at least one feature currently running real-time that should be batch, and probably one piece of a real-time feature (a slow embedding step, a heavy feature lookup) that should be precomputed. Fixing those two things is usually where the biggest wins live.

The framework isn't elegant because the underlying decision isn't elegant. It's a product question dressed up as an infrastructure question, and the right answer depends on what your users actually do with stale predictions.

Frequently Asked Questions

How do I know if my model is fast enough for real-time inference?

Measure p50, p95, and p99 latency under realistic load — not just median latency on an empty server. If p99 exceeds your user-facing latency budget (typically 200–500ms for blocking calls), you have a tail latency problem even if the median looks fine. Either optimize the model, switch to async with a loading state, or move the expensive part to batch.

Can I start with real-time and migrate to batch later?

Yes, and this is usually the right move for early-stage products. Real-time is simpler to build and lets you iterate on the model quickly. Migrate features to batch when cost or latency shows up in your metrics — not before. The migration itself is mostly about adding a feature store and a scheduler, both of which are well-trodden infrastructure.

What's the difference between async inference and batch inference?

Async inference still processes one request at a time — it just doesn't block the user. Batch inference processes many inputs together on a schedule, writes results to a store, and serves them via lookup at request time. Async is for slow models on unpredictable inputs; batch is for predictable inputs where you can amortize compute.

Does using a hosted LLM API (OpenAI, Anthropic) change this framework?

The axes are the same, but cost dynamics shift. You're not amortizing GPU time, so the cost argument for batch weakens. The freshness, input cardinality, and tail latency arguments stay intact. Many teams batch LLM calls overnight specifically to use cheaper models or avoid rate limits during the day, which is a valid reason on its own.

How should I think about inference architecture costs and timelines for my specific product?

It depends heavily on your current stack, model choices, traffic patterns, and which features qualify for batch under the rules above. Talk to CodeNicely for a personalized assessment — generic numbers will mislead you more than they help.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.