How Vahak Matched 800K Trucks Without a Recommendation Collapse
For: A CTO or lead engineer at a Series B two-sided marketplace — freight, gig, or services — whose matching algorithm is degrading in quality as supply density grows, and who suspects the problem is architectural but can't pinpoint whether it's the model, the data pipeline, or the feedback loop
If you run engineering at a Series B two-sided marketplace, you've probably seen this pattern: you add more supply, your matching algorithm should get smarter, and instead the metrics that matter — fill rate, time-to-match, downstream retention — quietly degrade. The dashboards still look fine. The model's offline AUC is still strong. But the marketplace feels worse. Drivers complain. Shippers churn. Nobody can quite explain why.
This is the story of how the team behind Vahak — India's largest digital freight marketplace, with hundreds of thousands of trucks and lakhs of transporters on the platform — diagnosed and rebuilt their matching architecture when exactly that happened. The non-obvious answer, which we'll get to, is that the model wasn't broken. The labels were.
The original problem
Vahak connects truck owners and fleet operators with shippers who have loads to move. On paper it's a classic two-sided marketplace: loads on one side, trucks on the other, geography and timing in the middle. In practice, freight matching is significantly harder than ride-hailing because:
- A truck isn't fungible. A 32-foot multi-axle container is not a substitute for a 19-foot open-body, even on the same lane.
- The matching window is long — hours to days, not seconds.
- Both sides negotiate. The platform's recommendation is an opening move, not a final allocation.
- Loads are directional. A truck that takes a load from Nagpur to Chennai now has a positioning problem in Chennai.
The original matching engine was, sensibly, a learning-to-rank model. Given a load, score every nearby compatible truck on a feature vector — distance, vehicle type fit, historical lane preference, owner rating, response latency, price band — and surface the top N. The training labels came from the natural feedback signal: did the recommended owner respond? Did the load close with that owner? Did the trip complete?
For the first couple of years this worked. Match rates climbed. The team shipped feature after feature. Then, somewhere past a few hundred thousand active vehicles on the platform, the curve flattened. Then it bent the wrong way. Time-to-first-response on certain lanes started rising. Shipper repeat rate on long-haul, low-volume routes weakened. Fill rate on key lanes plateaued even as supply on those lanes grew.
The team's first instinct — and this is the instinct of every senior ML engineer I know — was that the model needed retraining. So they retrained it. More features, more recent data, better hyperparameters. It barely moved.
What the team tried first
Before we get to what worked, here's what didn't, in roughly the order it was attempted. If you're staring at a degrading marketplace right now, you've probably tried some of these.
1. Bigger model, more features
Gradient-boosted ranker became a deep ranker. New features around lane seasonality, fuel price proxies, festival calendars, return-load probability. Offline NDCG improved by a meaningful amount. Online, the lift was within noise.
Lesson: when offline metrics improve and online metrics don't, your training distribution doesn't match your serving distribution. We'll come back to this.
2. More aggressive cold-start handling
The team assumed new trucks were getting buried by the ranker because they had no history. They added explore-exploit logic — a Thompson sampling layer that gave new owners a fixed allocation of impressions on each load. This helped new owners. It didn't help fill rate. The cold start problem freight marketplace teams obsess over turned out to be a symptom, not the disease.
3. Fixing the data pipeline
There were real issues here — duplicate event firing, lane normalization quirks, vehicle-type taxonomy drift over years of free-text inputs. Cleaning this up was worthwhile and improved data quality. It did not fix the matching degradation.
4. Re-tuning the objective function
The team experimented with multi-objective optimization: balance match probability with margin, with shipper retention proxies, with geographic spread. Helpful in principle. But you can't tune your way out of a problem if the labels feeding the optimization are themselves corrupted.
The thing that was actually broken
Here is the diagnosis the team eventually arrived at, and it's the part most marketplace teams miss until they're well past it.
At marketplace scale, the dominant failure mode of a matching model is not data sparsity. It is feedback loop pollution.
Concretely, two things were happening simultaneously on the supply side:
Supply-side gaming. Experienced fleet owners had figured out, empirically, what made the algorithm surface them. Faster response times got rewarded. So some owners started auto-responding to every recommendation, then negotiating or declining off-platform. Owners who quoted aggressively low and renegotiated later were acquiring loads at a higher rate. The model, which used "responded within X minutes" and "won the load" as positive signals, was learning that these owners were the best matches. They weren't. They were the best at gaming the surface area of the matching engine.
Selection bias in accepted loads. The model was trained on loads that closed. But which loads close depends on which trucks the model surfaces, which depends on the model's prior beliefs. Over time, certain owner-lane combinations got reinforced — they were shown more, they accepted more, they got shown more. Owners who were objectively a good fit for a lane but had been under-surfaced for whatever historical reason simply disappeared from the training distribution. The model became confidently wrong about who the best match was, because it had stopped seeing counter-evidence.
Both of these problems get worse with more supply, not better. More owners means more potential gamers. More density on a lane means more opportunity for the feedback loop to lock in a suboptimal subset. This is why the curve bent the wrong way past a certain scale.
The architectural call
The fix was not a better model. It was a rebuild of how labels and serving traffic interact. Three structural changes, in order of importance:
1. Separate the matching objective from the closing signal
The original system collapsed several things into one label: did this owner respond, did they win the load, did the trip complete, was the shipper happy. The team broke this into a stack of separate models, each with its own clean objective:
- Compatibility model — given a load and a truck, what is the probability this is a structurally valid match? Trained on hard constraints (vehicle type, capacity, geography, timing) plus historical compatibility, with labels that don't depend on negotiation behavior.
- Engagement model — given a compatible match, what is the probability of a genuine response? Trained with explicit handling for auto-response patterns and negotiation outcomes, not just "did they reply."
- Outcome model — given an engaged match, what is the probability of a completed, satisfactory trip? This is where shipper retention and on-time completion live.
The final ranking is a calibrated combination of the three. Crucially, gaming one stage doesn't propagate to the others. An owner who auto-responds to everything inflates their engagement score but not their outcome score, and the outcome score has the largest weight.
2. Treat training data collection as a first-class system
The team built an explicit exploration budget into the ranker. A small but non-trivial fraction of impressions on every load go to candidates the model would not otherwise have surfaced — chosen with a bandit policy that prioritizes owners the model is most uncertain about for that lane. This sounds expensive. It is. The win is that the training data the next model trains on is no longer a closed loop of the previous model's beliefs.
This is the single change senior ML teams underestimate. Your model is not just predicting outcomes; it is generating the data the next version will learn from. If you don't budget for exploration, you are guaranteeing distributional collapse. The exact size of the budget is a tradeoff — too little and you don't escape the loop, too much and current-period match quality suffers. Vahak's team tuned it per-lane based on how concentrated the historical accept distribution was.
3. Counterfactual evaluation before any model ships
Offline AUC lied to the team for months. So offline AUC was demoted. The new gating metric for any candidate model is a counterfactual policy evaluation against logged exploration data — essentially, "if we had been running this policy during last month's exploration impressions, what would have happened?" Models that look better on AUC but worse on counterfactual fill rate don't ship. Models that are AUC-neutral but improve counterfactual fill rate do.
This is borrowed from contextual bandit literature and it is, in our experience, the single most important piece of infrastructure for a mature logistics AI matching architecture. Without it you are flying blind. With it, your offline experiments actually predict online behavior.
What it moved
We're going to stay qualitative here because the specific lift numbers belong to Vahak. What we can say:
- Time-to-first-genuine-response on previously degrading lanes reversed direction within weeks of the engagement-vs-outcome split shipping.
- Fill rate on long-haul, lower-density lanes — the ones where the feedback loop had been worst — improved meaningfully and held.
- The percentage of recommendations that resulted in completed trips, not just initial responses, became the team's North Star metric. It is harder to game and it correlates with the only thing the marketplace actually sells: empty truck miles that get turned into loaded miles.
- Shipper repeat rate on lanes where the new architecture was rolled out first improved relative to control lanes.
You can read more about the engagement on the Vahak case study page.
What this approach is bad at
Honest tradeoffs, because nobody else will tell you these:
- It is more expensive to operate. Three models instead of one. An exploration budget that costs current-period revenue. Counterfactual evaluation infrastructure. If your marketplace is pre-product-market-fit, do not do this. Ship the single ranker, get to scale, then rebuild.
- It is harder to debug. When a recommendation looks wrong, you now have to ask which of three models contributed what. Investing in per-stage interpretability is not optional.
- The exploration budget is politically hard. Business stakeholders see "we deliberately showed a worse recommendation 5% of the time" and react badly. You need a CEO or CPO who understands why this is non-negotiable. If you don't have that buy-in, the exploration budget will get cut the first time quarterly numbers wobble, and you will be back where you started within two model retraining cycles.
- Cold-start gets harder, not easier. A new owner now has to clear three model thresholds, not one. The exploration budget partly compensates, but if onboarding velocity is your top metric, this architecture creates friction.
What generalizes to other marketplaces
If you run a two-sided marketplace recommendation engine and you're seeing quality decay at scale, here is the diagnostic order we'd recommend:
- Check the gap between offline and online metrics. If offline keeps improving and online doesn't, you almost certainly have feedback loop pollution. No amount of model work fixes this.
- Audit your labels for gaming. Look at the top decile of supply by your current scoring function. Are they good, or are they good at being scored well? These are different things.
- Measure exploration coverage. What fraction of your supply-side base has received any meaningful traffic in the last 90 days? If that number is dropping, your training distribution is collapsing.
- Decompose the objective before you decompose the model. A single label that mixes intent, engagement, and outcome is almost always hiding a feedback loop. Separating them often reveals which stage is actually broken.
- Invest in counterfactual evaluation before your next major rebuild. Otherwise you'll rebuild on the same broken signal.
This pattern shows up well beyond freight. It applies to gig labor matching, B2B services marketplaces, lending platforms where credit scoring feeds itself, and any recommendation system where the model's output influences the data the next model trains on. Which is to say: most of them.
How CodeNicely can help
The Vahak engagement is the most direct reference point if you're a logistics or freight marketplace seeing matching quality plateau or degrade as supply grows. The architectural patterns above — objective decomposition, exploration budgets, counterfactual evaluation — were built and tested in production against real Indian freight market dynamics, which are about as adversarial a matching environment as you'll find anywhere.
If your situation is different — say, you're earlier stage and the question is how to design the matching system from the start so you don't end up here in two years — that's a different conversation, and the AI Studio team works on that kind of greenfield architecture too. For teams in adjacent domains where feedback loops poison labels (credit decisioning, healthcare recommendations), the Cashpo work on credit scoring loops and HealthPotli work on drug interaction recommendations are closer references.
What we won't do is quote you a generic timeline or cost. The honest answer is that rebuilding a matching engine while it's running production traffic depends heavily on your current architecture, data quality, and how much exploration risk your business can absorb. Talk to us and we'll do an honest assessment.
Frequently Asked Questions
How do I know if my marketplace matching problem is the model versus the feedback loop?
Run this test: track your offline evaluation metric (AUC, NDCG, whatever you use) and your online business metric (fill rate, completion rate) on the same axis over the last six months. If offline keeps improving and online is flat or declining, the model is fine — your training labels are corrupted by feedback loop dynamics. No retraining will fix this; you need to change how training data is collected.
What is feedback loop pollution in a recommendation system?
It's the phenomenon where a model's outputs determine which interactions get observed, which become the training data for the next model, which reinforces the same outputs. Over time, the model becomes confidently wrong about parts of the supply or demand base it has stopped surfacing, because it never sees evidence that contradicts its priors. Combined with supply-side actors who learn to game the scoring function, this causes match quality to decay even as raw supply grows.
How much exploration traffic should a mature marketplace allocate?
There's no universal number. It depends on how concentrated your current accept distribution is, how fast your supply base churns, and how much current-period revenue you can sacrifice for long-term model health. Most production systems we've worked with land somewhere between 3% and 10%, often varied per segment. The right answer for your platform requires looking at your data — contact CodeNicely for a personalized assessment.
Does this approach work for ride-hailing or gig platforms, or only freight?
The architectural pattern — separating compatibility, engagement, and outcome into distinct models, plus a budgeted exploration policy and counterfactual evaluation — generalizes to most two-sided marketplaces. Ride-hailing has shorter matching windows so the engagement and outcome models look different, but the principle of separating gameable signals from outcome signals applies. Gig labor platforms benefit even more because the gaming surface is larger.
Can we add exploration to our existing ranker without rebuilding it?
Often yes, as a first step. Wrapping your current ranker with a contextual bandit layer that allocates a small fraction of impressions to under-explored candidates is a reasonable retrofit and will start improving your training data within weeks. It won't fix label corruption from gaming — that requires the objective decomposition — but it will stop your training distribution from collapsing further while you plan the larger rebuild.
Building something in Logistics & Supply Chain?
CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.
Talk to our team_1751731246795-BygAaJJK.png)