Questions to Ask Before Hiring an AI Credit Scoring Vendor
For: A Series A fintech founder building a digital lending product who is about to sign a contract with an AI credit scoring vendor and suspects their technical team lacks the depth to spot model quality red flags during demos
Vendor demos for AI credit scoring models are designed to look bulletproof. They show you an AUROC of 0.82 on a clean historical dataset, walk you through a slick explainability dashboard, and quote a few logos. Six months after you integrate, your charge-off rate creeps up and nobody can tell you why. The model didn't break — it was never tested on borrowers who looked like yours.
The hardest failure mode in alternative credit scoring isn't a model that's obviously wrong. It's a model trained on banked, salaried borrowers that quietly degrades on gig workers, first-credit applicants, or thin-file segments without throwing a single visible error. By the time the loss curve confirms it, you've underwritten a cohort you can't unwind.
If your technical bench is thin and you're a few weeks from signing, these are the questions that surface real risk. For each one: why it matters, what a credible answer sounds like, and what should make you walk.
Training data and population fit
1. What's the exact composition of your training data by income type, employment status, and credit file thickness?
Why it matters: A model trained predominantly on salaried, bureau-rich borrowers will not generalize to gig-economy or first-time borrowers. This is the single biggest source of silent distribution shift.
Good answer: They give you a breakdown — "60% salaried, 25% self-employed, 15% gig; 70% with bureau scores above 650, 20% thin-file, 10% no-file" — and ask about your applicant mix to assess overlap.
Red flag: "Our model works across all segments" or vague claims about diversity without numbers.
2. How recent is your training data and how often is the model retrained?
Why it matters: Credit behavior shifted materially through COVID, post-COVID inflation, and the recent fintech tightening cycles. A model trained on 2019 data is reading a different economy.
Good answer: Rolling retraining cadence (quarterly or better), with a documented process for monitoring population stability index (PSI) and triggering off-cycle retraining when it breaches a threshold.
Red flag: Annual retraining or "the model is stable, it doesn't need frequent updates."
3. Can you run your model on a sample of our anonymized application data before we sign?
Why it matters: This is the only test that matters. Their AUROC on their data tells you nothing about performance on yours.
Good answer: Yes, with a clear data-sharing protocol, and they'll report performance broken down by your key segments — not just an aggregate score.
Red flag: They refuse, want a signed contract first, or only agree to score applicants without comparing predictions to actual repayment outcomes from your historical book.
Model performance and validation
4. Show me your model's performance broken down by decile, segment, and vintage — not just headline AUROC.
Why it matters: AUROC of 0.78 can hide a model that's excellent on prime borrowers and worse than random on thin-file applicants. Decile-level KS statistics and lift curves reveal where the model actually discriminates.
Good answer: They have a standard validation pack: KS by decile, Gini, calibration plots, lift charts, and out-of-time validation showing performance held up on a holdout vintage.
Red flag: They quote one number repeatedly. Or worse, they quote accuracy.
5. What's your false negative rate at the approval threshold we'd actually use?
Why it matters: Lending economics live and die at a specific cutoff. Aggregate metrics are meaningless if you're approving the top 30% of applicants — what matters is the bad rate inside that 30%.
Good answer: They simulate at multiple cutoffs and show you the approval rate vs. expected bad rate tradeoff curve.
Red flag: Confusion when you ask. They've optimized for an academic metric, not a P&L.
6. How does the model perform on out-of-time data versus out-of-sample?
Why it matters: Out-of-sample (random holdout) almost always looks good. Out-of-time (a later vintage the model never saw) is where overfitting and concept drift show up.
Good answer: A 10-15% degradation between in-time and out-of-time is normal and they own it. They explain how they monitor drift in production.
Red flag: No degradation (suspicious — likely leakage) or they don't run out-of-time validation at all.
7. What features drive the most predictive power, and what happens to performance if any one of them is unavailable for an applicant?
Why it matters: If 60% of model lift comes from bureau scores, the model is useless on no-file borrowers — regardless of how many alternative data sources are bolted on.
Good answer: They show feature importance, ablation studies, and a fallback strategy when key features are missing.
Red flag: "Our model uses 2,000 features so no single one matters" — usually a deflection.
Explainability, fairness, and compliance
8. Walk me through the adverse action reason codes the model generates for a declined applicant.
Why it matters: Regulators in most jurisdictions require specific, accurate reasons for declines. "The model said no" is not a defensible answer in an audit. AI lending model explainability is a compliance requirement, not a nice-to-have.
Good answer: SHAP-based or equivalent feature attribution, mapped to human-readable reason codes, with documentation showing the codes are faithful to the model's actual decision logic.
Red flag: Generic reason codes that look the same across most declines, or post-hoc explanations from a separate model (which can disagree with the actual scoring model).
9. How do you test for disparate impact across protected classes, and what do you do when you find it?
Why it matters: Even if your jurisdiction's fair-lending laws are looser than ECOA, your investors and banking partners will eventually demand this. And alternative data features are notorious for proxying demographics.
Good answer: Documented disparate impact testing, adverse impact ratios, and a mitigation playbook (reweighting, adversarial debiasing, or feature removal).
Red flag: "We don't use protected attributes so there's no bias." That's not how proxy discrimination works.
10. Is the model a black box to you, or can your team explain why any individual decision was made?
Why it matters: If the vendor can't explain a specific decision when a borrower disputes it, you'll be the one fielding the regulator's call.
Good answer: They can pull up any historical decision and walk through the top contributing features and their directional impact.
Red flag: "It's a deep learning model, explainability is approximate."
Production, monitoring, and integration
11. What does your production monitoring stack look like, and what alerts fire when?
Why it matters: Silent degradation is the killer. You need PSI on inputs, score distribution monitoring, and early-stage delinquency tracking — not just uptime.
Good answer: They monitor input drift, score drift, and approval rate by segment, with alert thresholds and a documented response process. They'll share their monitoring dashboard.
Red flag: Monitoring means "the API is up."
12. What's your latency at p95 and p99, not p50?
Why it matters: Your funnel conversion is sensitive to scoring latency. A model that's fast on average but spikes to 8 seconds on 1% of requests will hurt you.
Good answer: Specific numbers (e.g., p95 under 400ms, p99 under 800ms) with SLA commitments.
Red flag: Only p50 numbers, or vague "sub-second" claims.
13. Who owns the model output if we churn — and can we export historical scores and explanations?
Why it matters: If you switch vendors in 18 months, you need historical scoring data to validate the new vendor and to defend past decisions in audits.
Good answer: Clear data ownership terms, exportable scoring logs with reason codes, no lock-in on historical decisions.
Red flag: Scoring data lives only in their system and they charge for export.
14. How do you handle the cold-start problem on borrower segments we haven't lent to before?
Why it matters: Every fintech eventually expands segments. You need to know whether the vendor will recalibrate or just shrug.
Good answer: A documented process for champion-challenger testing, shadow-mode scoring on new segments, and willingness to incorporate your repayment data into retraining.
Red flag: "The model already covers everyone" — see question 1.
15. Can we run your model in shadow mode against our existing scorecard for 60-90 days before going live?
Why it matters: Shadow mode is the only safe way to validate a new model on your live traffic without risking the book. Any vendor confident in their product will agree.
Good answer: Yes, with a clear plan for comparing predictions and a defined go-live criterion.
Red flag: Pressure to go live quickly, or pricing structures that punish shadow-mode periods.
One last filter
If a vendor passes 12 of these 15, you're in reasonable shape. If they push back on questions 3, 4, and 15 — the ones that actually expose model quality on your data — assume the model isn't ready and the sales team knows it.
The cheapest mistake in fintech credit risk vendor selection is signing before you've seen the model run on your applicants. The most expensive is discovering the mismatch in production. We've seen this play out across lending builds, including Cashpo's KYC and credit scoring stack, where the gap between vendor demo and production behavior is usually wider than founders expect.
Frequently Asked Questions
What's the difference between AUROC and KS statistic, and which should I ask vendors for?
AUROC measures overall ranking ability across all thresholds; KS measures the maximum separation between good and bad borrowers at a specific cutoff. For lending, KS at your operating threshold is usually more actionable because it tells you how well the model discriminates where you actually approve or decline. Ask for both, plus the lift curve.
Should we build our own credit scoring model or use a vendor?
Early on, vendors get you to market faster and let you focus on origination and collections. Once you have 12-18 months of repayment data on your specific borrower base, an in-house model — or a hybrid where you fine-tune on a vendor's base model — usually outperforms generic vendor scores. The decision hinges on data volume, regulatory posture, and engineering capacity.
How do I evaluate alternative data sources used by an AI credit scoring vendor?
Ask which alternative data sources contribute the most lift on your target segment, how the vendor handles missingness, and what the legal basis is for using each source in your jurisdiction. Some alternative data — telco, utility, transactional — can be predictive but legally restricted depending on geography. Don't assume a vendor's data stack is portable.
How long should a shadow-mode evaluation last before going live?
Long enough to observe early-stage delinquency on a meaningful sample of approved applicants — typically at least one full collection cycle past first payment due. Going live purely on score-distribution comparisons without seeing repayment behavior is a common and expensive shortcut. The right duration depends on your loan tenor and volume; for a personalized assessment of your evaluation plan, talk to CodeNicely's AI team.
What are the biggest red flags during an AI credit scoring vendor demo?
Three stand out: refusing to score a sample of your historical data, quoting only aggregate AUROC without segment or out-of-time breakdowns, and vague answers on adverse action reason codes. Any of these alone is a reason to slow down. All three together means the model probably isn't ready for your book.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)