Logistics & Supply Chain technology
Businesses Logistics & Supply Chain June 24, 2026 • 11 min read

Batch vs. Real-Time AI Inference: Pick the Right One

For: A COO or engineering lead at a mid-size logistics or operations company who has one AI feature running in real-time and two more on the roadmap — and is watching cloud costs climb while the team argues about whether every new feature needs a live inference endpoint

Default to batch inference. Use real-time only when the prediction has to incorporate an event that occurred in the last few seconds and a decision will be made on it within the same session. Everything else — ETA forecasts, demand planning, route scoring, fraud risk on completed shipments, churn signals, driver allocation — should run on a schedule, write to a feature store or database, and be read in O(1) by your application. That single rule will cut most logistics AI inference bills by half or more, and usually improves accuracy because batch jobs can afford features that real-time endpoints cannot.

The reason teams get this wrong is path dependence. The first AI feature ships behind an HTTP endpoint because that is the obvious pattern, the cloud vendor's tutorial uses it, and "real-time" sounds like the safer answer. Then every subsequent feature inherits the architecture. Six months later you are paying for GPU-backed endpoints to predict things that change once a day, and the one feature that actually needs sub-second latency is sharing capacity with batch-shaped workloads.

This post gives you a decision rule you can apply per feature, and the three axes that actually matter when picking between batch and real-time AI inference.

The decision, stated crisply

For every AI feature on your roadmap, you are choosing one of three serving patterns:

  1. Pure batch — predictions are computed on a schedule (hourly, nightly, weekly), written to a table or feature store, and read by the application like any other data.
  2. Pure real-time (online) — predictions are computed on demand via an inference endpoint, typically in tens to hundreds of milliseconds, using features assembled at request time.
  3. Hybrid (precomputed + on-demand override) — a batch job produces a baseline prediction; a lightweight real-time model adjusts it using a small set of fresh signals.

The mistake is treating this as a latency question. It is a feature freshness question, and freshness is rarely what you think it is.

The non-obvious bit: freshness vs. accuracy is a real tradeoff

Real-time inference forces you to compute features at request time. That means anything expensive — a 90-day rolling aggregate across millions of shipments, a graph embedding over your carrier network, an LLM-generated summary of recent customer interactions — is either skipped, approximated, or precomputed elsewhere. Your online feature budget is usually somewhere between 50ms and 200ms total, and most of that gets eaten by network, serialization, and the model itself. You get maybe 20-50ms for feature lookup.

Batch pipelines have no such budget. A nightly job can join twelve tables, run a gradient-boosted model with 400 features, validate the output, and write it to Postgres. The model is more accurate because it sees more signal. The serving cost at read time is a primary-key lookup.

So the real question is not "how fast does this need to respond?" It is: does this prediction's accuracy depend on something that happened in the last few minutes? If no, batch wins on every dimension — cost, accuracy, operational simplicity, debuggability.

The three axes that actually matter

1. Decision latency tolerance (not prediction latency)

Distinguish between how fast the prediction must be served and how fresh the input data must be. A user waiting on a screen needs a response in <500ms. That response can be a row read from a table populated four hours ago. That is still "real-time serving" from the user's perspective, but it is batch inference.

Ask: if the prediction was computed at 3am and read at 2pm, is the answer still correct enough to act on? For most logistics use cases — carrier selection, lane pricing, demand forecasting, warehouse staffing, SLA risk scoring on in-flight shipments — the answer is yes.

2. Input volatility

How quickly does the input data change in ways that materially affect the output?

Be honest here. Teams routinely classify medium-volatility problems as high because it feels more impressive. ETAs that update every fifteen minutes are almost always good enough. ETAs that need to reflect a delay that happened ninety seconds ago are rare and expensive to build correctly.

3. Prediction surface size

How many distinct predictions exist in the universe? If it is bounded and enumerable — every (origin, destination, carrier, service-level) tuple, every active customer, every SKU in every warehouse — you can precompute the whole grid. If it depends on free-form inputs that combinatorially explode (a user-typed query, an image, an arbitrary document), you cannot precompute, so real-time is forced on you.

Most logistics operational AI features have small, enumerable surfaces. There are not infinite lanes. There are not infinite carriers. There are not infinite customers. Precompute the grid nightly, refresh the rows that changed, serve from a key-value store.

Honest scoring: what each pattern is bad at

Batch inference — the downsides

Real-time inference — the downsides

Hybrid — the downsides

The decision rule

Apply this in order. Stop at the first yes.

  1. Does the prediction depend on data that is less than 60 seconds old, and will a decision be made on it within the same user session? → Real-time. Examples: fraud scoring at checkout, dynamic dispatch when a driver cancels, real-time inventory allocation.
  2. Is the prediction surface unbounded (free-form input, images, documents, queries)? → Real-time. Examples: customer-support intent classification, document extraction, search ranking.
  3. Does the input change meaningfully within the hour, but a 5-15 minute lag is acceptable? → Hybrid, or frequent batch (every 5-15 min). Examples: in-flight ETA updates, warehouse pick prioritization during a shift, surge pricing on lanes.
  4. Everything else. → Batch. Run nightly or hourly. Write to a table. Read in O(1).

In our experience, on a typical logistics AI roadmap of 8-10 features, the split lands roughly 1-2 real-time, 1-2 hybrid, and the rest batch. If your split looks more like 6 real-time and 2 batch, you are almost certainly over-engineering.

Two worked examples for a logistics operator

Feature: "Predict which shipments will miss SLA in the next 24 hours"

Tempting to build as a real-time endpoint hit by the ops dashboard. Wrong choice. The relevant inputs (origin scan time, current location, carrier historical performance on this lane, weather forecast) change on a 15-30 minute cycle at most. The prediction surface is bounded — active shipments only. Run it every 15 minutes as a batch job over the active shipment table. Write a risk score and contributing factors to a column. The dashboard reads the column. Total inference cost: rounding error. Accuracy: better than real-time because you can include lane-level rolling features that would not fit in an online budget.

Feature: "Route a driver around an accident reported 90 seconds ago"

This one is genuinely real-time. The input (traffic event) is seconds old, the decision (reroute now) is immediate, and the prediction surface depends on the driver's exact position. Build the endpoint. Pay the cost. Do not try to be clever with batch here.

Architecture implications

If most of your features are batch, your AI inference architecture looks less like a model-serving platform and more like a data platform with model steps. Concretely:

For the one or two genuinely real-time features, run a separate, focused serving stack — KServe, BentoML, Modal, SageMaker endpoints, whatever fits your cloud. Do not try to make it serve batch workloads too. The economics and SLOs are different.

How CodeNicely can help

We worked with Vahak, a logistics marketplace, on exactly this class of problem — moving from a one-size-fits-all inference setup toward feature-by-feature decisions on where prediction work should live. The work involved separating genuinely real-time concerns (driver-load matching at the point of search) from batch-friendly ones (lane analytics, carrier reliability scoring, demand patterns), which improved both serving cost and model accuracy on the batch side because we could finally afford richer features.

If you are sitting on one live inference feature and two more on the roadmap, the highest-leverage work right now is not building the next endpoint. It is auditing the first one to see if it should have been batch, and applying the decision rule above to the next two before any code is written. We do this kind of audit and replatforming as part of our digital transformation engagements. You keep the IP, no lock-in.

If you are in situation A, do X

Frequently Asked Questions

Is batch inference the same as offline prediction?

Mostly yes — the terms are used interchangeably. Both refer to computing predictions on a schedule and storing them, rather than computing on demand. "Online vs batch prediction" is the same dichotomy as "real-time vs batch inference." The relevant distinction is when the model runs, not how the application reads the result.

Won't users complain if predictions are a few hours old?

Almost never, if you pick the right features for batch. Users complain about predictions that are wrong, not predictions that are stale. A four-hour-old prediction from a 400-feature model usually beats a fresh prediction from a 40-feature model. Show the timestamp in the UI if you want to be transparent — most users do not notice.

How do I handle cold-start entities in a batch pipeline?

Two patterns. First, a default model — a simpler, lower-feature variant that runs on demand for entities the batch job has not seen yet. Second, an incremental micro-batch that runs every few minutes and only processes new entities. Most teams start with the default-model pattern because it is simpler to operate.

What about LLM-based features — those have to be real-time, right?

Only if the input is user-generated at request time. LLM features over bounded inputs (summarize this shipment's exception history, classify this support ticket's category, extract fields from this invoice) can absolutely run as batch jobs. You write the LLM output to a column, you read the column. Same rule applies: is the input free-form and the decision in-session? If not, batch.

Can you help us decide which pattern fits each feature on our roadmap?

Yes — this is a common first engagement. We audit your existing inference setup, apply the decision framework to your roadmap features, and propose a target architecture with cost and operational tradeoffs spelled out. Contact CodeNicely for a personalized assessment based on your stack and roadmap.

Building something in Logistics & Supply Chain?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team