Startups Fintech May 6, 2026 • 11 min read

How KarroFin Scaled AI Credit Scoring Without Killing Approval Rates

Q: How do I tell if my credit model is drifting or my applicant pool has changed?

Run a counterfactual rescoring exercise. Score recent applicants under both the current model and the original model. If both produce similar score distributions but approvals are still down, the population has shifted and the model is correctly assigning lower scores. If the current model produces meaningfully different scores on the same applicants, that's model behavior change. PSI on input features tells you population shift; PSI on output scores tells you model behavior change.

Q: What does it cost to build a feature contract layer and counterfactual harness?

It depends on the current state of your pipeline, data infrastructure, and how much labeled outcome data you have available. Contact CodeNicely for a personalized assessment — we'll scope against your specific stack and risk posture rather than offering a generic number.

For: A Series A fintech founder whose AI credit scoring model passed underwriting review and hit 250K users but is now watching approval rates quietly compress — and cannot tell if the model is getting more conservative, the applicant pool is shifting, or the feature pipeline has drifted

KarroFin's credit model wasn't broken. There were no alerts, no failed batches, no auc collapse on the monitoring dashboards. But approval rates had drifted down roughly 9 percentage points over four months, and the growth team was furious. The data science team insisted the model was performing within tolerance. Both were right. That's the problem this post is about.

What follows is an honest walkthrough of how we diagnosed it, what we tried first that didn't work, and the architectural decision that pulled approval rates back up without raising default risk. If you're running an AI credit scoring model in production and watching approval rates compress at scale, the framing here will save you a quarter of debugging the wrong thing.

The setup: a model that passed every test it was given

KarroFin launched its credit scoring model after a clean underwriting review. The architecture was reasonable: gradient-boosted trees on top of bureau pulls, device signals, employment metadata, and behavioral features pulled from the onboarding flow. They calibrated the decision threshold against a labeled dataset of about 10K applicants from their pilot phase. Default rate held at target. Approval rate sat in a healthy range. Underwriters signed off.

By the time they reached 250K applicants, three things had changed and nobody had a clean way to attribute the impact:

The applicant mix had shifted. Early users came from referral and a paid acquisition channel skewed toward salaried professionals in tier-1 cities. At 250K, organic was dominant, the geographic mix was wider, and a growing share of applicants were thin-file or gig-economy workers.
Two upstream data providers had silently changed their response schemas. Not breaking changes — just additional null patterns and a few enum values that hadn't existed during training.
A product team had added three new fields to the application form to support a separate compliance requirement. Those fields ended up in the feature pipeline as engineered inputs, even though no one had decided they should be predictive.

The model was doing exactly what it was trained to do. The world it was scoring had moved.

What the team tried first (and why it didn't work)

The instinct, when approval rates compress, is to assume model drift and retrain. KarroFin's data science team did this. They pulled six months of recent labeled data, retrained the model, and shipped a new version. Approval rates moved up about 1.5 points and then resumed compressing within three weeks.

The second instinct is to look at PSI (Population Stability Index) on input features. They did that too. Several features had PSI above 0.2, which flagged real population shift, but retraining had supposedly absorbed that. So why was approval still drifting?

Here's what nobody had questioned: the decision threshold itself. The cutoff score that separated approve from decline had been set once, calibrated against the original 10K population, and inherited across two retrains. The model was producing well-calibrated probabilities. The threshold sitting on top of those probabilities was the artifact of a different business — a smaller, more homogenous, more risk-averse business that needed to prove the model worked before it could grow.

This is the insight worth internalizing: the threshold that minimizes default risk at launch is almost never the threshold that maximizes business health at scale. And treating that threshold as a data science parameter, owned by the modeling team, instead of a product and risk decision owned jointly with finance and growth, is what turns a working credit scoring model into a growth ceiling.

The diagnostic framework we built

Before touching the threshold, we needed to separate the three possible causes of approval compression so we could argue about them with evidence instead of intuition. We instrumented the pipeline to attribute every percentage point of approval change to one of:

Population shift — the same model, scoring a different mix of applicants, would naturally produce a different approval distribution. This is not a bug. This is the world changing.
Model behavior change — the retrained model assigning meaningfully different scores to applicants whose underlying risk hadn't changed.
Feature pipeline drift — upstream data shape changes, null-handling differences, or new features silently entering the pipeline and affecting scores.
Threshold mismatch — the cutoff being wrong for the current applicant distribution, regardless of model quality.

We built a counterfactual scoring harness. Every applicant from the last 90 days was rescored under four conditions: the original v1 model with the original threshold, the current v3 model with the original threshold, the current v3 model with a recalibrated threshold, and the current v3 model with the original feature schema (stripping the three new product-form fields and the new enum values from upstream providers).

The attribution looked roughly like this: about 40% of approval compression came from genuine population shift (riskier applicants showing up — model was correct to score them lower), about 15% came from feature pipeline drift introduced by the upstream schema changes, about 10% came from the three new product form fields silently becoming inputs, and roughly 35% came from the threshold being miscalibrated for the current population.

That last number is what mattered. More than a third of the compression was a threshold problem masquerading as a model problem. No amount of retraining would have fixed it because retraining doesn't move the cutoff — it just changes the score distribution underneath the cutoff.

The architectural call

We made three changes. None of them were glamorous. All of them generalized.

1. Decoupled the threshold from the model

The threshold became a versioned, separately-owned artifact. The model team owns score calibration (probabilities should mean what they say). A risk committee — credit, finance, product — owns the threshold. Every threshold change ships through a documented decision with explicit tradeoff math: expected approval rate, expected default rate at portfolio level, expected revenue impact, and expected provisioning impact.

This sounds bureaucratic. It isn't. It's the difference between a 22-year-old data scientist quietly choosing a number that determines whether your loan book grows and a cross-functional decision with audit trail. Lenders that scale survive this transition. Lenders that don't, get surprised by their own approval rates.

2. Built a feature pipeline contract layer

The two upstream provider schema changes and the three product-form fields had no business affecting scores until someone explicitly approved them. We added a feature contract: every input the model consumes is declared, typed, and version-pinned. New fields entering the application form do not flow through to scoring unless a feature engineer registers them. Schema changes from upstream providers fail closed — the pipeline rejects unexpected shapes and routes the applicant to a fallback scoring path until a human approves the new schema.

This eliminated the silent-drift class of bugs entirely. It also made every PSI alert actionable, because we now knew the input set was stable by contract.

3. Threshold recalibration as a product ritual

Instead of treating recalibration as something that happens when someone notices a problem, we made it a quarterly review. Every quarter, the team rescored a sample of recent applicants across a grid of candidate thresholds, projected default and approval impact at portfolio level, and presented options to the risk committee. The committee picks one. The decision is logged.

The first recalibration after we shipped this pulled approval rates back up by about 6 percentage points with a defaults impact the risk committee accepted because it was visible, modeled, and chosen — not stumbled into.

What the metrics actually moved

Within roughly two months of shipping the threshold decoupling, the feature contract layer, and the first recalibration:

Approval rate recovered most of its compression — not by overriding the model but by aligning the cutoff with the actual applicant distribution.
Time from "approval rate looks weird" to "we know exactly why" dropped from days of debate to hours of attribution.
The number of silent feature pipeline incidents dropped to zero, because the contract layer made silence impossible.
The risk committee gained the ability to make explicit risk-appetite choices instead of inheriting them from a model nobody wanted to question.

Default rate did move up slightly — that's the honest tradeoff. The point isn't that we found a free lunch. The point is that the previous threshold was leaving expected-value money on the table because nobody had the framework to argue about whether to take it.

What this approach is bad at

To be honest about the tradeoffs:

It assumes well-calibrated probabilities. If your model outputs scores that don't correspond to actual default probabilities, threshold tuning is voodoo. Calibration has to come first. Platt scaling or isotonic regression on a holdout set, validated against actual outcomes.
It requires labeled outcome data. Threshold recalibration depends on knowing default rates by score band, which requires loans to mature. Lenders with very short books or very long tenor products have to be careful here.
The feature contract layer adds friction. Product teams that used to add form fields freely now have to coordinate with the modeling team. This is correct but slow. You pay a velocity cost for the safety.
Quarterly recalibration is too slow for some markets. If your applicant population is shifting weekly because you're entering new geographies fast, quarterly is wrong. The cadence has to match the volatility.

The lessons that generalize

If you operate any production ML system that gates a business decision — credit, fraud, pricing, eligibility — these patterns apply:

Separate the model from the decision. The model produces a score or probability. The decision threshold is a business policy. Conflating them puts a policy choice inside an engineering artifact, where it gets lost.

Make feature pipelines contractual. Silent drift from upstream providers and silent additions from product teams are the two most common sources of "the model is acting weird and we can't tell why." Both are preventable with explicit input contracts.

Attribution before action. When a metric moves, resist the urge to retrain. Attribute the change to population, model, pipeline, or threshold first. Acting without attribution is how you ship three retrains that don't fix the problem.

Recalibration is a product ritual, not an incident response. Build it into the calendar. The discipline of forcing the risk committee to look at threshold tradeoffs every quarter — even when nothing seems wrong — surfaces drift before it becomes compression.

How CodeNicely can help

If you're running an AI credit scoring model in production and approval rates are doing things you can't fully explain, the work above is the kind of engagement we do. The KarroFin project sits alongside our work with Cashpo on lending, KYC, and AI credit scoring — both involved taking a model that passed underwriting and rebuilding the operational layer around it so the business could actually scale on top of it.

The pattern matches if any of these sound familiar: your model passed review but the team can't agree on why approvals are compressing; feature pipelines have grown organically and no one fully owns what's in them; the decision threshold was set once and hasn't been formally revisited since; or your data scientists and your risk team are arguing about a number neither of them feels fully responsible for.

We bring the diagnostic framework, the engineering to ship the contract layer and counterfactual harness, and — honestly more important — the cross-functional facilitation to move threshold ownership from the modeling team to a risk committee without breaking trust. If that fits, our AI studio is the right place to start the conversation.

Frequently Asked Questions

How do I tell if my credit model is drifting or my applicant pool has changed?

Run a counterfactual rescoring exercise. Take recent applicants and score them under both the current model and the original model. If both produce similar score distributions but approvals are still down, the population has shifted and the model is correctly assigning lower scores. If the current model produces meaningfully different scores from the original on the same applicants, that's model behavior change. PSI on input features tells you population shift; KS or PSI on output scores tells you model behavior change.

Should I retrain my credit model when approval rates drop?

Not as the first move. Retraining changes the score distribution but not the threshold sitting on top of it, and in our experience a meaningful share of approval compression at scale comes from threshold mismatch rather than model staleness. Attribute the drop first — population, pipeline, model, or threshold — and act on what the attribution says.

Who should own the credit decision threshold?

Not the modeling team alone. The threshold is a business policy that determines approval rate, default rate, revenue, and provisioning. It should be owned by a risk committee that includes credit, finance, and product, with the modeling team providing the tradeoff math. Treating the threshold as an engineering artifact is how lenders end up with risk appetites nobody explicitly chose.

How often should we recalibrate a production credit scoring model?

It depends on how fast your applicant population is moving. For a lender in a stable geography with a steady acquisition mix, quarterly threshold review is reasonable. For a lender expanding fast or seeing major channel shifts, monthly may be necessary. The cadence should match the volatility of the input distribution — not a fixed industry default.

What does it cost to build a feature contract layer and counterfactual harness?

It depends on the current state of your pipeline, your data infrastructure, and how much labeled outcome data you have available. Contact CodeNicely for a personalized assessment — we'll scope it against your specific stack and risk posture rather than quote a generic number.

Building something in Fintech?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team