How Vahak Onboarded 800K Trucks Without Breaking Its AI
For: A Series B marketplace founder or CTO who has an AI-matching feature that works well at their current supply volume but is visibly degrading as they run an aggressive supply-side onboarding push — and cannot tell whether the model needs retraining or whether onboarding velocity itself is the architectural problem
Every Series B marketplace founder we've worked with has hit the same wall: the AI matching feature that wowed investors at 50K active carriers starts behaving strangely at 300K, and by 600K the ops team is quietly routing high-value loads manually because they don't trust the model anymore. The instinct is to retrain. The instinct is wrong.
This is a walkthrough of how we worked through that exact problem on Vahak, India's largest transport marketplace, as the platform scaled past 800,000 onboarded trucks. The lessons generalize to any two-sided marketplace running an aggressive supply push while depending on a behavioral ML model.
The setup: a matching model tuned on warm supply
Vahak connects load owners (manufacturers, traders, transporters) with truck owners across India. The core product surface is a matching engine: a load is posted, the system identifies which carriers are most likely to accept, complete, and price the trip well, and ranks them.
The original matching model was trained on what you'd expect — historical bid behavior, route familiarity, on-time completion rates, price acceptance bands, response latency, cancellation patterns. For a carrier with 40+ completed trips on the platform, the feature vector was rich. The model performed well. Load owners got fast confirmations. Cycle time dropped.
Then the growth team did its job. The onboarding funnel got aggressive — vernacular onboarding, regional field teams, referral incentives for fleet owners. New carriers started arriving in waves. Tens of thousands per week. This is where the model quietly broke.
What the team noticed first
The symptoms didn't look like a model problem. They looked like an ops problem:
- Load owners complained that suggested carriers weren't responding
- Average bids-per-load went up, but acceptance rate per bid went down
- Ops noticed that high-value lanes were getting matched to carriers who'd never actually run that route
- The internal dashboard showing "model confidence" was still high — the model was confident, just wrong
The first hypothesis, naturally, was that the model needed retraining on more recent data. So we retrained. Match quality improved for two weeks, then degraded again. Then we retrained with feature engineering tweaks — adding recency decay, regional priors, more aggressive handling of nulls. Same pattern: short-term lift, return to degradation.
The frustration was real. The team had instrumented the model well. The features were clean. The retraining cadence was disciplined. And it still wasn't holding.
The diagnostic that flipped the framing
The breakthrough came from a simple cut of the data. We split match quality by carrier tenure on the platform — how many days since onboarding, and how many completed trips. Three buckets: cold (0 trips), warming (1–10 trips), warm (10+ trips).
The warm bucket was performing fine. Match quality there was stable, even slightly improving with the retraining work. The warming bucket was noisy but acceptable. The cold bucket was a disaster — and because cold carriers were now a huge share of the population due to onboarding velocity, they were dragging the aggregate metric down and, worse, polluting the ranking output for warm loads too.
The model, when scoring a load, would surface a cold carrier in a top-5 position because the model had nothing to penalize them with. No completion history meant no negative signal. The carrier looked "average" because the imputed defaults for missing features sat in the middle of the distribution. Average, in a ranking model, is dangerous — it floats to the top whenever the truly strong candidates aren't a perfect fit.
So the real defect wasn't the model. The model was doing what it was trained to do. The defect was that we'd built a single serving path that treated cold and warm supply as the same population, when behaviorally they aren't even the same species.
The architectural call
The fix was structural, not algorithmic. We split the matching pipeline into two distinct serving paths:
Path A: Warm matching (behavioral)
This is the original model, but with a hard eligibility gate. A carrier only enters the warm matching pool after crossing a minimum behavioral threshold — completed trips, response history, verified route experience. The model only ranks within this pool. The features it was trained on actually exist for every carrier it scores. Confidence scores mean something.
Path B: Cold routing (rules + exploration)
New carriers don't get scored by the behavioral model at all. They flow through a separate path that uses declared attributes (truck type, registered base location, owner-stated route preferences, KYC tier) and a controlled exploration policy. The cold path's job is not to predict the best match — it's to generate behavioral data as fast as possible, safely, on lower-stakes loads.
The two paths feed a single ranked output to the load owner, but the blend is governed by load value, urgency, and route criticality. A high-value, time-sensitive load on a known lane sees almost entirely warm carriers. A low-value flexible load gets a deliberate injection of cold carriers — that's where the platform pays a small expected-quality cost to buy behavioral signal on new supply.
Why this works (and what it costs)
The reason this works is that it stops asking one model to do two incompatible jobs. Behavioral ranking and cold-start exploration are different problems. Behavioral ranking optimizes a known objective over a known distribution. Cold-start exploration deliberately accepts short-term suboptimality in exchange for information gain on a new sub-population.
Stuffing both into a single model means you either over-penalize new carriers (and starve them of the trips they need to become warm) or under-penalize them (and pollute the ranking with confident-but-uninformed predictions). Neither is recoverable through retraining. You can't retrain your way out of a structural mismatch between what the model is being asked to do and what its features support.
What it costs, honestly:
- Operational complexity. Two paths means two sets of monitoring, two sets of failure modes, two release cycles. The team's on-call surface grew.
- Cold path tuning is a real ongoing job. The exploration policy needs to balance carrier activation against load-owner experience. Too aggressive and load owners lose trust. Too conservative and new carriers never accumulate enough trips to graduate. This is a knob, not a fix-and-forget.
- The blend logic is product-sensitive. How aggressively you inject cold supply on a given load depends on load characteristics, load owner tolerance, and lane density. We've changed the blend rules multiple times based on cohort feedback.
What moved
After the split was deployed and the cold path's exploration policy stabilized:
- Aggregate match quality stopped its slow slide and recovered to pre-onboarding-push levels
- Time-to-first-trip for newly onboarded carriers improved meaningfully, because the cold path was actively routing them load opportunities instead of waiting for the behavioral model to surface them by accident
- Load owner trust in the suggested ranking returned — measured by bid-acceptance latency and manual override rate from ops
- The platform could continue aggressive onboarding without each new wave of carriers degrading the experience for existing ones
We're deliberately not putting precise percentages on these here. The directional point is what matters: the model didn't need to be smarter. The system around it needed to stop pretending cold and warm supply were the same thing.
The generalizable lessons
If you're running a marketplace with an AI matching feature and you're in the middle of a supply-side onboarding push, here's what to take from this.
1. Scaling supply is a data-sparsity event, not a data-abundance event
This is the counterintuitive part. Every new carrier you onboard is a new row of mostly-nulls in your feature store. Your aggregate volume goes up, but your behavioral coverage per entity goes down. If your model was tuned on a population that had behavioral history, you are now serving a population that mostly doesn't. The model isn't degrading because it's stale. It's degrading because the input distribution has shifted underneath it.
2. Confidence scores can be confidently wrong
A ranking model with imputed defaults for missing features will produce stable-looking confidence numbers on entities it knows nothing about. Your monitoring won't catch this unless you slice by data completeness, not just by output metrics. Always cut model performance by feature-coverage cohort. Always.
3. Cold-start is an architectural concern, not a feature engineering concern
You cannot solve cold-start by adding three more features or by smarter imputation. Those help at the margins. The real fix is to acknowledge that cold entities and warm entities live in different feature spaces and need different serving logic. This is true in lending (where Cashpo faces the same shape of problem with thin-file borrowers), in healthcare AI, in recommender systems, and in any marketplace.
4. Onboarding pipeline and model serving must be designed as separate systems
If your growth team can change onboarding velocity without your ML team knowing, you have an architectural bug. The rate at which new entities arrive directly shapes the cold/warm population ratio, which directly shapes model performance. The two need a contract — either a feedback loop, or rate-aware routing, or both.
5. Retraining is the wrong first move
When a marketplace model degrades during growth, the team's instinct is always to retrain. Retraining helps when the world has changed underneath a stable population. It doesn't help when the population itself has shifted. Diagnose which one you're in before spending a sprint on retraining.
How to know which problem you actually have
A quick diagnostic for the founder or CTO reading this:
- Pull your last 90 days of matching outputs. For each match, record carrier tenure (days since onboarding) and completed-trip count at the time of the match.
- Bucket into cold (0 trips), warming (1–10), warm (10+).
- Plot your primary match-quality metric per bucket, per week.
- If the warm bucket is stable and the cold bucket is bad and growing as a share of total — you have the problem in this post. Split your serving path.
- If all three buckets are degrading roughly together — you actually do have a model staleness problem. Retrain.
- If warm is degrading and cold is fine — your warm population's behavior is shifting (new lane mix, seasonality, demand-side change). Different fix again.
The diagnostic takes a day. The wrong fix takes a quarter.
How CodeNicely can help
The Vahak engagement is directly relevant if you're a Series B or growth-stage marketplace where AI matching is core to the product and where supply onboarding is being pushed hard. We worked alongside Vahak's team through exactly the diagnostic-to-architecture-to-deployment arc described above — including the unglamorous parts: the monitoring redesign, the cold-path exploration policy tuning, the blend rules, the on-call playbooks for when the two paths disagree.
What we tend to bring to engagements like this is the willingness to challenge the framing before writing code. Most marketplace teams have strong ML engineers who can retrain a model. Fewer have the architectural pattern library that comes from having shipped this specific problem before across healthcare, fintech SaaS, lending, and logistics. If you want to talk through where your matching system is, our AI studio team handles these diagnostics regularly for scaleups.
Frequently Asked Questions
How do I know if my marketplace AI is suffering from cold-start pollution versus model staleness?
Slice your match-quality metric by entity tenure. If new entities (zero or near-zero behavioral history) are performing badly and growing as a share of your population while older entities perform fine, you have a cold-start architecture problem. If all tenure cohorts are degrading together, you likely have a staleness or distribution-shift problem that retraining can address.
Can't I solve this with better feature imputation instead of splitting the serving path?
Imputation helps marginally but doesn't fix the structural issue. A ranking model can't distinguish between "this entity is genuinely average" and "this entity is unknown and I'm imputing average values." Both produce mid-range scores that float to the top of the ranking when stronger candidates aren't a clean fit. The fix is to stop scoring unknown entities with a model that was trained on known entities.
What's the right ratio of cold to warm supply to surface in matching results?
There's no universal number. It depends on load value, load owner tolerance for variance, and how quickly your platform needs to graduate new supply to warm status. Most marketplaces tune this per load segment — high-value loads see almost no cold supply, low-stakes loads see a controlled injection. This is an ongoing tuning job, not a one-time setting.
Do I need to rebuild my entire matching system to implement this split?
Usually not. The behavioral model often stays largely intact — you're adding an eligibility gate in front of it and building a parallel cold-routing path alongside it. The bigger lift is typically the monitoring redesign and the blend logic that combines the two paths' outputs. For a personalized assessment of what this would look like on your stack, contact CodeNicely.
How does this apply to marketplaces beyond logistics?
The pattern generalizes wherever you have a two-sided platform with a behavioral ML model and active supply (or demand) growth. Lending platforms see it with thin-file borrowers. Recruitment marketplaces see it with newly registered candidates. Recommender systems see it with new items. The architectural principle — separate cold and warm serving paths, treat cold-start as exploration not prediction — holds across all of them.
Building something in Logistics & Supply Chain?
CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.
Talk to our team_1751731246795-BygAaJJK.png)