Feature Stores Explained: Why Your ML Models Stale Out
For: A Series B fintech CTO whose credit risk or fraud ML model performs well in training but drifts unexpectedly in production — and suspects the problem is somewhere in how features are computed, but has never heard of a feature store and doesn't know if they need one
Your credit risk model scored 0.87 AUC in backtesting. Three months into production, it's drifting toward 0.78 and your data scientists keep blaming “distribution shift.” Maybe. But before you retrain on fresh data for the third time this quarter, check something else: are the features your model sees at inference time computed by exactly the same code that produced the training set? In most fintech ML stacks the answer is no — and that gap, not the data, is why your model is decaying.
This is called training-serving skew, and it's the problem feature stores were built to solve.
The problem: two pipelines computing “the same” feature
A typical credit scoring model uses features like avg_transaction_amount_30d, num_failed_logins_7d, or days_since_last_credit_inquiry. During training, a data scientist writes a SQL query or pandas script against your warehouse to compute these for every historical user. The model trains, validates, ships.
Now a loan application comes in. Your API needs that same feature vector in under 200ms. Nobody is going to run a 40-second Snowflake query per request, so a backend engineer reimplements the feature logic in Python or Go against your Postgres replica or Redis cache. They get it 95% right.
That 5% delta is your skew. Examples we've seen in production:
- Training pipeline used UTC timestamps. Serving pipeline used local time.
num_transactions_todaywas off by up to 24 hours for one in three users. - Training computed
avg_transaction_amount_30das a strict 30-day rolling window. Serving used “last 30 transactions.” For dormant users these are wildly different numbers. - Training joined against a customers table that filtered out closed accounts. Serving didn't. Closed-account features leaked into live scoring.
- Training treated NULL as 0. Serving treated NULL as -1 because the engineer thought a sentinel value was cleaner.
None of these will throw an error. The model just quietly gets worse. Your team will spend weeks investigating “concept drift” when the actual problem is that the model never saw the production feature distribution because production is computing different numbers.
The analogy: a recipe vs. two cooks
Think of a feature as a dish. Your data scientist wrote the recipe (training pipeline). Your backend engineer wrote a different recipe from memory (serving pipeline). Both call the result “tomato soup,” but one is using fresh tomatoes and the other is using paste. The model was trained to recognize the first one.
A feature store is the single recipe book. Both cooks read from it. The dish tastes the same in the test kitchen and the dining room.
What a feature store actually is
Strip the marketing away and a feature store is three things glued together:
- A feature definition layer. You declare a feature once — its source, its transformation, its freshness requirements — usually in code (Python, SQL, or YAML). Tools: Feast, Tecton, Hopsworks, Databricks Feature Store, Vertex AI Feature Store.
- An offline store for training. Typically a warehouse or lake (BigQuery, Snowflake, S3 + Parquet). Holds the full history. Supports point-in-time correct joins so you don't leak future data into training labels.
- An online store for serving. A low-latency KV store (Redis, DynamoDB, Bigtable). Holds the latest feature value for each entity, refreshed by the same pipeline that populates the offline store.
The critical property is that both stores are written by the same transformation code. You don't have a training pipeline and a separate serving pipeline. You have one definition, materialized two ways.
A minimal worked example
Say you're building a fraud model and you need num_distinct_merchants_7d per user. In Feast, the definition looks roughly like this:
@feature_view(
entities=[user],
ttl=timedelta(days=7),
source=transactions_source,
)
def user_merchant_diversity(df):
return (
df.groupby("user_id")
.agg(num_distinct_merchants_7d=("merchant_id", "nunique"))
)At training time you call store.get_historical_features(entity_df, ["user_merchant_diversity:num_distinct_merchants_7d"]). It runs the transformation against your warehouse, point-in-time joined to your label timestamps. No leakage.
At serving time, an Airflow or streaming job runs the same transformation and pushes the latest values into Redis. Your API does store.get_online_features(...) and gets the value in under 10ms.
Same code. Same logic. Different storage targets. That's the whole trick.
Real-time feature store vs batch: what your use case needs
This is where most teams overbuild. The real-time feature store vs batch question comes down to how fresh a feature needs to be at the moment of prediction.
- Batch features are computed on a schedule (hourly, daily).
avg_credit_utilization_90ddoesn't need to reflect a transaction from 30 seconds ago. Cheap, simple, covers 70-80% of fintech use cases. - Streaming features are updated as events arrive via Kafka/Kinesis.
num_transactions_last_5minfor fraud detection needs this. Adds real engineering cost — exactly-once semantics, watermarks, late events. - On-demand features are computed at request time from request payload (e.g.,
distance_between_billing_and_shipping_zip). The feature store just guarantees the function definition is shared between training and serving.
Most credit scoring models do fine with batch. Most card-fraud models need streaming for at least a handful of velocity features. Don't pay for streaming complexity if your decision window is “within 24 hours.”
Gotchas nobody mentions in the vendor demo
- Point-in-time correctness is harder than it sounds. If your label is “did this loan default within 90 days,” your training features must reflect what was knowable at application time, not now. Feature stores help, but you still need to be careful about which timestamp column drives the join.
- Online/offline parity is a discipline, not a checkbox. If a streaming job lags 4 hours behind the batch job, your online features will silently disagree with offline. Monitor freshness and value-distribution drift between the two stores explicitly.
- The feature store doesn't fix bad features. If your
credit_utilizationfeature was always slightly wrong, the store will faithfully serve the wrong feature in both environments. Garbage in, consistent garbage out. - Operational cost is real. You now run a Redis cluster, an offline store, and a materialization pipeline. Small teams shipping one model often don't need this — a well-tested shared Python library that both training and serving import can get you most of the way.
- Vendor lock-in. Tecton and Hopsworks are great but proprietary. Feast is open source but you're operating it. Pick based on team size, not feature checklist.
When you actually need a feature store
You probably need one if:
- You have more than 2-3 ML models in production sharing overlapping features.
- You've already had at least one incident traced back to training-serving skew.
- Your features mix batch (warehouse) and real-time (stream) sources.
- You're in a regulated context (lending, fraud) where reproducibility matters for audits.
You probably don't need one if:
- You have one model and a small team. A shared transformation library plus disciplined integration tests is cheaper.
- All your features are computed from the request payload at inference time. There's nothing to materialize.
- You're still figuring out whether the model itself is worth shipping. Build the feature store after product-market fit, not before.
For fintech teams shipping credit or fraud models, the typical pattern we see at companies like Cashpo is: start without a feature store, get burned by skew once, then introduce Feast or a managed equivalent for the second model. That order is usually right. The first model teaches you what your features actually are. The second one is when consistency starts paying for itself.
The diagnostic question
Before you blame data drift the next time your model misbehaves, run one experiment: pull a sample of users scored in production yesterday. Recompute their features from scratch using your training pipeline. Compare value-by-value.
If more than 1-2% of feature values disagree, you don't have a model problem. You have a feature store problem — whether you call it that or not.
Frequently Asked Questions
Do I need a feature store if I only have one ML model in production?
Probably not yet. A shared Python module imported by both your training notebook and your serving API, plus integration tests that assert feature parity on a held-out sample, will catch most skew at a fraction of the operational overhead. Reach for a feature store when you have multiple models, multiple teams touching features, or streaming sources mixed with batch.
What's the difference between a feature store and a data warehouse?
A warehouse stores raw and modeled data optimized for analytics queries. A feature store sits on top, defines reusable feature transformations, and adds an online store for sub-100ms serving plus point-in-time correct historical joins for training. You'll typically run both — the feature store reads from the warehouse for batch features.
Can I use a feature store for credit scoring with only batch features?
Yes, and this is the most common fintech setup. Credit decisions usually tolerate features that are hours or even a day stale. You get the consistency benefits of the store without taking on the streaming infrastructure. Add real-time only when a specific feature genuinely needs sub-minute freshness — typically velocity checks for fraud, not credit risk.
Feast vs Tecton vs building in-house — how do I decide?
Feast is open source, lightweight, and good if you have platform engineers who can operate Redis and orchestration. Tecton and Hopsworks are managed, opinionated, and faster to adopt if you'd rather pay than operate. Building in-house only makes sense at significant scale where existing tools don't fit your data model. For most Series B fintechs, Feast or a managed service is the right call.
How long does it take to roll out a feature store for an existing ML system?
It depends heavily on how many features you have, how clean your warehouse is, and whether you need streaming. Migration is rarely a big-bang project — most teams onboard one model at a time. For a scoped assessment of your specific setup, talk to CodeNicely for a personalized review.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)