Startups Logistics & Supply Chain May 12, 2026 • 10 min read

5 Mistakes We Made Shipping AI to a Live Transport Marketplace

Q: What does it take to fix a matching system that's already underperforming in production?

It depends on which mistake is dominant and how clean your existing instrumentation is. Some fixes like adding location freshness or calibrating confidence thresholds are straightforward; others like instrumenting rejection reasons across the carrier app or building counterfactual logging require coordinated product and engineering work. Contact CodeNicely for a personalized assessment of your stack.

For: A Series B logistics marketplace founder whose engineering team just shipped an AI matching or routing feature and is seeing silent failures — low acceptance rates, carrier churn, or route suggestions that look right in staging but get rejected on the ground — and cannot tell whether the problem is the model, the data pipeline, or a marketplace dynamic the model was never taught

Your routing model passed offline evaluation at 87% top-1 accuracy. It's been live for six weeks. Carrier acceptance is stuck at 22%. Empty legs haven't moved. Engineering thinks it's the model. Ops thinks it's the carriers. Product is rebuilding the feature flag. Nobody is right, and nobody is wrong — because in a two-sided transport marketplace, AI failure modes don't look like AI failures. They look like silence.

We've shipped matching and routing systems on live freight platforms and watched the same five mistakes recur. None of them show up in your logs. All of them show up in your unit economics three months later. Here's the post-mortem, written for the founder who's reading carrier churn dashboards at 11pm trying to figure out what broke.

Mistake 1: Treating low acceptance rate as a model accuracy problem

This is the most expensive mistake on the list, and almost every team makes it.

Your model predicts the best carrier for a load. The carrier rejects it. Your training pipeline logs this as a negative example. You retrain. The model gets "better" by your offline metric. Acceptance rate stays flat or drops.

What actually happened: the carrier rejected the load because their truck was already mid-route, their driver had timed out on HOS, the rate was below their floor that week, or they were holding capacity for a regular shipper. The match was correct. The market wasn't.

Symptom in production: acceptance rate plateaus despite model improvements. Top suggestions get rejected at similar rates to mid-tier suggestions. Your model becomes more confident on the wrong signal.

Root cause: you trained a matching model on a label ("rejected") that conflates match quality with carrier availability. The model has no feature for liquidity. It cannot see that the "best" carrier had no truck within 200km that hour.

How to recover: separate your label space. A rejection is not a negative — it's one of at least four classes: (a) carrier unavailable, (b) rate mismatch, (c) lane mismatch, (d) genuine bad match. Instrument the rejection reason at the carrier app level, even if it's a single-tap dropdown. Train only on class (d). Use classes (a)–(c) as features for a separate availability model that gates the matcher's output.

The hard truth: you may discover your model is actually fine and your liquidity is the problem. That's a commercial fix, not a model fix.

Mistake 2: Building routing AI on stale location data and not knowing it

Routing models assume the world they see in the feature store is the world the carrier is operating in. It almost never is.

The carrier app pings location every 5 minutes. Your routing engine pulls the last known position when generating suggestions. The matching cron runs every 2 minutes. So your model is regularly making decisions on location data that is 7 minutes old, in a domain where a truck can move 10km in that window.

Worse: a meaningful percentage of your carriers have background location disabled, are in low-connectivity zones, or kill the app to save battery. Their "last known" is from 4 hours ago, and your model is treating it as current.

Symptom in production: suggestions that look perfect in staging (where you replay with backfilled data) get rejected on the ground with "too far" as the reason. Acceptance rates are higher for carriers in dense urban corridors than for long-haul carriers — because urban carriers ping more often.

Root cause: your feature pipeline doesn't carry data freshness as a feature. The model treats a 30-second-old GPS reading the same as a 90-minute-old one.

How to recover: add location_age_seconds as an explicit feature. Add a confidence band — anything older than N minutes gets a wider radius. Don't suggest carriers whose location confidence has decayed past a threshold. And measure freshness by carrier segment — your long-haul cohort almost certainly has worse data than you assume.

Mistake 3: Optimizing for the wrong side of the marketplace

Most freight AI teams build matching from the shipper's perspective: minimize cost, minimize time, maximize on-time rate. The carrier is treated as a resource to allocate.

Carriers don't behave like resources. They behave like small businesses with cashflow, fuel, return-leg planning, and relationship preferences. A carrier who accepts a load that ends in a low-volume district at 9pm is stuck with a deadhead return. They learn this in two trips. After that, your model's "optimal" suggestion gets ignored — silently. They don't reject; they just don't tap.

Symptom in production: notifications are read but not acted on. Specific lane combinations show high view rates and low acceptance. Carrier churn concentrates in cohorts that took several "optimal" loads early in their lifecycle.

Root cause: the objective function only contains shipper-side variables. You're maximizing short-term shipper value at the cost of long-term carrier retention, which collapses liquidity, which destroys shipper value.

How to recover: add a return-leg probability score for every suggested load. Add a carrier-lifetime-value adjustment to the ranking. Track not just acceptance but the carrier's next 3 loads after each accepted match. If accepting your suggestion correlates with reduced activity over the next 14 days, your model is eating its own liquidity.

Mistake 4: Shipping without a counterfactual logging layer

You can't improve what you can't measure, and most matching systems are unmeasurable by construction.

When your model picks carrier A and the load gets accepted, you log success. You never find out what would have happened if you'd picked B, C, or D. Your training data is permanently biased toward whatever your current policy favors. Over time, the model collapses onto a narrow set of carriers — the ones it picked first — and stops exploring.

Symptom in production: a small group of carriers takes a disproportionate share of loads. New carrier acceptance rates are terrible. The model "works" but the marketplace concentrates dangerously.

Root cause: no exploration policy, no counterfactual logs, no off-policy evaluation framework. You shipped a greedy ranker and called it AI.

How to recover: log the top-K candidates and their scores for every matching decision, not just the chosen one. Inject a small epsilon of randomized exploration (5–10% is usually enough) so you build a dataset where the model's preferences and outcomes can be decoupled. Run inverse-propensity-weighted evaluation when you retrain. This is non-negotiable for any serious real-time AI transport platform — without it, every model update is a guess.

Mistake 5: No human-in-the-loop for edge cases the model wasn't trained on

Freight is full of edge cases that don't show up in 95% of your data: temperature-sensitive cargo, oversized loads, regulated goods, holiday closures, regional strikes, weather reroutes. Your model sees these so rarely it treats them as noise.

Then a shipper posts a perishable load on a Friday before a long weekend, your model suggests a carrier with a 6-hour return window, and the entire shipment fails. The shipper churns. The carrier blames the platform. You blame the model. The model did exactly what the data told it to.

Symptom in production: rare but catastrophic failures. Specific shipper complaints that don't correlate with any model metric. Ops manually overriding suggestions for certain load types — often without telling engineering.

Root cause: no confidence threshold below which the model defers to a human. No flagging system for out-of-distribution loads. The model is treated as binary on/off rather than as one input into a dispatcher's decision.

How to recover: calibrate your model's confidence (Platt scaling or isotonic regression works for most cases). Below a threshold, route the suggestion to a human dispatcher with the model's top-3 candidates as a recommendation. Track override rates by load attributes — your dispatchers' overrides are the highest-quality labeled data you'll ever get. Feed them back into training.

The diagnostic framework: is it the model, the data, or the marketplace?

If you take one thing from this post, take this decision tree:

Acceptance rate is low and uniform across cohorts: likely a model or ranking problem. Investigate features and labels.
Acceptance rate is low in specific lanes or carrier segments: liquidity problem. The model is being asked to match in a market that doesn't exist.
Acceptance rate degrades over time without code changes: data freshness or distribution shift. Audit your pipeline freshness and check for seasonal effects the model never saw.
Acceptance is fine but post-acceptance metrics (on-time, completion) are bad: objective function problem. You're optimizing for the wrong outcome.
A small group of carriers takes most loads: exploration problem. Your model has collapsed.

Most teams skip this triage and just retrain. Retraining a model that has a liquidity problem makes the liquidity problem worse, because the model gets more confident about the wrong thing.

How CodeNicely can help

We built the matching, route optimization, and carrier-side intelligence for Vahak, one of India's largest trucking marketplaces. The work that mattered wasn't the model — it was the surrounding system: rejection-reason instrumentation, freshness-aware features, counterfactual logging, and a dispatcher override loop that fed back into training. If your team is shipping route matching AI in production and seeing the patterns above, that engagement maps closely to your problem space.

If you're earlier in the cycle — model not yet live, or live but without proper telemetry — our AI Studio team works specifically on instrumenting marketplace AI so you can tell the difference between a model failure and a market condition. We don't take engagements where the answer is "throw a bigger model at it." Most freight AI problems are systems problems.

Frequently Asked Questions

Why does our freight matching model perform worse in production than in offline evaluation?

Offline evaluation assumes the labels in your training set reflect match quality. In a live marketplace, rejection labels are heavily contaminated by carrier availability, rate disputes, and behavioral patterns the model can't see. The model looks good offline because it's learning to predict your historical policy, not real-world acceptance. Separate your rejection reasons and re-evaluate.

How do we know if our problem is model quality or marketplace liquidity?

Segment your acceptance rate by lane, carrier cohort, and time of day. If the rate is uniformly low, suspect the model. If it's concentrated in specific lanes or time windows, you have a liquidity problem — the model is being asked to match where supply doesn't exist. Fixing the model won't help; you need supply-side incentives or routing across time windows.

Should we use a foundation model or build a custom matching model for freight?

For matching and routing specifically, custom models almost always win because the features that matter (location freshness, return-leg probability, carrier reliability, lane-level pricing) are proprietary and not in any foundation model's training data. Foundation models are useful for adjacent problems — load description parsing, document extraction, support automation — but not for the core matching engine.

How much exploration randomness is safe to inject into a live matching system?

Most teams find 5–10% epsilon-greedy exploration is the sweet spot. Below 3%, you don't generate enough counterfactual data to evaluate policy changes. Above 15%, carriers and shippers start noticing inconsistent suggestion quality. Start at 5%, log everything, and tune based on how quickly your off-policy evaluation converges.

What does it take to fix a matching system that's already underperforming in production?

It depends on which of the five mistakes is dominant and how clean your existing instrumentation is. Some fixes (adding location freshness as a feature, calibrating confidence thresholds) are straightforward; others (instrumenting rejection reasons across the carrier app, building counterfactual logging) require coordinated product and engineering work. Contact CodeNicely for a personalized assessment of your stack.

Building something in Logistics & Supply Chain?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team