SaaS technology
Startups SaaS May 10, 2026 • 8 min read

Your AI Feature Doesn't Need More Data. It Needs a Harder Objective.

For: A Series B SaaS product lead whose AI feature has been underperforming for two quarters despite the engineering team continuously adding training data and retraining — and who is starting to suspect the problem isn't the data volume at all

If your AI feature has been flat for two quarters and your engineering team keeps shipping more training data, stop. The problem almost certainly isn't data volume. It's that the objective your model is optimizing stopped mapping to the outcome your users care about — probably six months ago, probably without anyone noticing.

This is the most common failure mode I see in Series B SaaS teams with shipped AI features. The model is doing exactly what it was told. What it was told is no longer the right thing.

The thesis

Adding training data only helps when your loss function is a faithful proxy for the user outcome. When it isn't, more data just makes the model more confidently wrong about the wrong thing. You don't need a bigger dataset. You need a harder, more honest objective.

The diagnostic question isn't "is our model accurate?" It's "accurate at what, and does that thing still matter?"

Why this happens to good teams

On day one, someone — usually a founding engineer — picked an objective function. Click-through rate. Classification accuracy. RMSE on a labeled set. F1 on a binary judgment. That choice was reasonable at the time. It was tractable, measurable, and correlated with something the product manager cared about.

Then three things happened.

First, the product evolved. The feature that started as "surface relevant items" became "surface items that drive activation in the first session." The objective stayed the same.

Second, the user base shifted. Power users behave differently from new users, and the metric that captured value for the early cohort no longer captures it for the median user today.

Third — and this is the killer — the model got good enough at the proxy that it started exploiting it. This is Goodhart's Law in production. When a measure becomes a target, it ceases to be a good measure. A recommender optimized for CTR learns to surface clickbait. A support-ticket classifier optimized for accuracy learns to over-predict the majority class. A lead-scoring model optimized for conversion probability learns to score warm inbound leads highly and ignore the cold ones where the actual lift would have come from.

Each retraining cycle on more data makes this worse, not better. You are sharpening a knife pointed in the wrong direction.

Three concrete examples

1. The recommender that killed retention

A B2B SaaS team I talked to had a content recommendation feature inside their product. CTR on recommendations was their north star for the model. Over eighteen months, CTR climbed from 4% to 11%. Engineering was proud. Then someone finally looked at the cohort data: users who engaged with recommendations had worse 90-day retention than users who ignored them. The model had learned to surface novelty — short, surprising items that earned a click but added no durable value. The fix wasn't more data. It was reformulating the objective to predict 30-day return visits conditional on a click, which is a much harder learning problem with much sparser signal.

2. The support classifier that optimized away the hard tickets

A support automation feature was hitting 92% classification accuracy and the team kept feeding it labeled tickets to push it higher. The actual business outcome — deflection rate without CSAT damage — had been flat for a year. Why? The model had learned that the safest bet on ambiguous tickets was to route them to a generic queue. Accuracy stayed high because the generic queue was the modal class. CSAT on routed tickets was quietly tanking. The objective needed to change from accuracy to expected resolution quality, which required a completely different label schema and a much smaller, harder dataset.

3. The credit model that scored the easy cases

In lending, this pattern is brutal. A credit scoring model trained to maximize AUC on historical default data will become excellent at confirming what underwriters already know. The marginal lift — the cases at the decision boundary where automation actually creates value — is exactly where AUC is least informative. We've seen this in fintech work like credit scoring systems for lending products: the right objective isn't "predict default" but "predict default among applicants the existing rule engine would have rejected." That reformulation changes the entire training pipeline, and no amount of additional data on the easy cases would have surfaced the problem.

How to tell if you have an objective problem

Some signals that your ML model isn't improving with more data because the objective is wrong, not the dataset:

If three or more of these are true, more data won't save you.

The strongest counter-argument

Here's the honest pushback: sometimes it really is a data problem. Long-tail behaviors, rare events, distribution shift after a product launch — these are genuine cases where more or fresher data is the right fix. And reformulating an objective is expensive. It usually means new labels, new evaluation harnesses, and renegotiating what "good" means with the people who fund the team.

The way to tell the difference: run a ceiling analysis. Take your current model and replace its predictions with ground truth on a sample. If perfect predictions on your current objective don't move the business metric you actually care about, you have an objective problem and no amount of data will fix it. If perfect predictions do move the business metric and your model is far from that ceiling, then yes, more or better data is the right investment.

Most teams skip this analysis because it's uncomfortable. It forces you to admit that the metric you've been reporting in board decks for a year might not matter.

What to do differently on Monday

If you're the product lead on a stagnating AI feature, here's the sequence:

  1. Write down the user outcome in plain English. Not the metric. The thing a user would say if you asked them whether the feature helped. "I found a customer faster." "I closed the books with fewer corrections." "I avoided a bad loan."
  2. Write down the current loss function. Be specific. Cross-entropy on what labels? RMSE against what target?
  3. Map one to the other. If a user got the outcome in step one but the model would score that case poorly, your objective is broken. If the model scores a case highly but no user outcome was achieved, your objective is broken.
  4. Run the ceiling analysis. Replace predictions with ground truth on a holdout. Measure the business metric, not the ML metric.
  5. Reformulate before you retrain. A new objective usually needs new labels, a new evaluation set, and a smaller, harder training corpus. This feels like going backwards. It isn't.

The teams that get unstuck are the ones willing to throw away a year of metric history and start measuring something harder. The teams that stay stuck keep adding data to a pipeline that was pointed in the wrong direction in 2023.

If you're rebuilding the objective layer of a production AI feature and want a second pair of eyes on the reformulation, that's the kind of work our AI engineering team does with scaleup product teams. But honestly — most of the diagnostic work above you can do internally in a week. The hard part isn't the analysis. It's being willing to act on what it tells you.

Frequently Asked Questions

How do I know if my ML model is suffering from objective mismatch versus a real data problem?

Run a ceiling analysis: replace your model's predictions with ground-truth labels on a holdout set and measure your actual business metric. If perfect predictions don't move the business outcome, your objective function is misaligned with the outcome and more data won't help. If they do move it and your live model is far from that ceiling, then better or more data is a legitimate fix.

What is Goodhart's Law and why does it matter for AI features in SaaS?

Goodhart's Law says that when a measure becomes a target, it stops being a good measure. In production ML, this shows up when your model gets so good at optimizing a proxy metric that it starts exploiting weaknesses in that proxy — surfacing clickbait, over-predicting majority classes, or ignoring the high-value edge cases where the feature was supposed to add lift. It's the single most common reason mature AI features stagnate.

Should we keep collecting training data while we reformulate the objective?

Usually no, or at least not on the same labeling schema. A new objective typically requires new labels with different annotation guidelines, a new evaluation set, and sometimes a much smaller training corpus focused on the cases that actually matter. Continuing to collect data on the old schema can entrench the team's commitment to the wrong target.

How often should a product team revisit the objective function of a shipped AI feature?

At minimum every time the product changes meaningfully, the user base shifts, or the business metric the feature was meant to drive plateaus for two consecutive quarters. In practice, most Series B teams under-revisit this — the original objective tends to ossify because it's wired into dashboards, retraining pipelines, and OKRs.

How long does it take to reformulate an AI feature's objective and what does it cost?

It depends entirely on the feature, the labeling complexity, and how much of the existing pipeline can be reused. For a personalized assessment of your specific situation, contact CodeNicely — we can scope the diagnostic and reformulation work against your current architecture.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.