Businesses SaaS June 25, 2026 • 11 min read

Detect Data Drift in a Scikit-learn Model Before Users Do

Q: Is PSI better than the Kolmogorov-Smirnov test for drift?

For scheduled monitoring, yes. PSI gives a single bounded number with industry-standard interpretation bands, while KS gives a p-value that grows artificially significant as sample size grows. KS is better for one-time comparisons; PSI is better for ongoing monitoring.

Q: Can I use this for regression models, not just classifiers?

Yes. PSI is computed on input features and is model-agnostic, so it works identically for scikit-learn regressors. The only adjustment is that your delayed-label performance check uses MAE or RMSE instead of precision and recall.

Q: How do I integrate this with an existing ML monitoring stack?

The JSON report is the integration point. Push it to Datadog as custom metrics, post alert summaries to Slack via webhook, or load it into your warehouse for trend dashboards. For help wiring drift detection into a broader production ML setup, contact CodeNicely for a personalized assessment.

For: A mid-level ML engineer at a 30–150 person B2B SaaS company who shipped a scikit-learn classification model three months ago, has no MLOps platform budget, and only found out the model was degrading when a customer complained — now tasked with adding drift detection before the next quarterly review

If your scikit-learn classifier is in production and you only learn it's degrading when a customer complains, you need a per-feature Population Stability Index (PSI) check running on a schedule — not a KL-divergence dashboard on the label distribution. PSI gives you one interpretable number per input feature, works with reference windows as small as a few thousand rows, and flags the silent input shifts that move business metrics weeks later. This tutorial walks through wiring PSI into an existing scikit-learn pipeline in under an hour, with no MLOps platform required.

Why PSI instead of KL divergence on predictions

Most drift tutorials default to KL divergence or chi-square on the predicted label distribution. That's a lagging signal. By the time your prediction mix has visibly shifted, the input distribution shifted weeks ago and your precision is already underwater on the segments that matter.

PSI, borrowed from credit scoring where it has been the standard for decades, measures the shift in a single feature's distribution between a reference period and a current period. The interpretation is fixed across teams and features:

PSI < 0.1 — no meaningful shift
0.1 ≤ PSI < 0.25 — moderate shift, investigate
PSI ≥ 0.25 — significant shift, retrain candidate

The tradeoffs are real. PSI assumes you can bin a feature sensibly, it's noisy on very low-cardinality categoricals, and it tells you that something shifted, not why. For high-cardinality text or embeddings you'll want a different tool (MMD or a domain classifier). For the tabular features that drive most B2B SaaS classifiers — plan tier, usage counts, account age, geography — PSI is the right first instrument.

Prerequisites

Python 3.9+
An existing scikit-learn classifier serialized with joblib
Access to the training feature matrix (or a representative sample ≥ ~5,000 rows) as your reference
A way to dump recent production inputs to a Parquet or CSV file on a schedule (cron, Airflow, dbt, whatever you already have)
pip install pandas numpy scikit-learn pyarrow

No new infrastructure. The output of this tutorial is a Python script you can run from cron and a JSON report you can pipe into Slack or email.

Step 1: Snapshot the reference distribution

The reference is the world your model was trained for. Save it once, version it, and don't change it until you retrain.

import pandas as pd
import joblib

# Load your training features (the X you fit on)
X_train = pd.read_parquet("data/train_features.parquet")

# Save as the immutable reference snapshot
X_train.to_parquet("drift/reference_v1.parquet", index=False)

print(X_train.shape)
print(X_train.dtypes)

Expected output:

(48211, 14)
plan_tier            object
seats                int64
monthly_active_users int64
account_age_days     int64
...

Pin the filename with a version suffix. When you retrain, you'll bump to reference_v2.parquet and reset your drift baselines.

Step 2: Write the PSI function

This is the whole calculation. No library needed.

import numpy as np
import pandas as pd

def psi(reference: pd.Series, current: pd.Series, bins: int = 10) -> float:
    """Population Stability Index for a single feature."""
    eps = 1e-6

    if pd.api.types.is_numeric_dtype(reference):
        # Quantile bins from reference; apply same edges to current
        edges = np.unique(np.quantile(reference.dropna(), np.linspace(0, 1, bins + 1)))
        if len(edges) < 3:
            return 0.0  # not enough variation to score
        ref_counts, _ = np.histogram(reference.dropna(), bins=edges)
        cur_counts, _ = np.histogram(current.dropna(), bins=edges)
    else:
        # Categorical: align on the union of categories
        categories = pd.Index(reference.dropna().unique()).union(current.dropna().unique())
        ref_counts = reference.value_counts().reindex(categories, fill_value=0).values
        cur_counts = current.value_counts().reindex(categories, fill_value=0).values

    ref_pct = ref_counts / max(ref_counts.sum(), 1)
    cur_pct = cur_counts / max(cur_counts.sum(), 1)

    ref_pct = np.where(ref_pct == 0, eps, ref_pct)
    cur_pct = np.where(cur_pct == 0, eps, cur_pct)

    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

Two design choices worth flagging. First, bin edges come from the reference, not the current window — otherwise you'd compare apples to apples by construction and miss the shift. Second, the epsilon prevents log-of-zero blowups when a category disappears in the current window, which itself is a strong drift signal.

Step 3: Score a recent production window

Pull the last seven days of inputs your model actually scored. Don't use labels — you usually don't have them yet, and drift detection should not wait for ground truth.

reference = pd.read_parquet("drift/reference_v1.parquet")
current = pd.read_parquet("data/scored_last_7d.parquet")

results = []
for col in reference.columns:
    if col not in current.columns:
        continue
    score = psi(reference[col], current[col])
    results.append({"feature": col, "psi": round(score, 4)})

report = pd.DataFrame(results).sort_values("psi", ascending=False)
print(report)

Expected output on a healthy week:

             feature     psi
0       account_age_days  0.0412
1                 seats   0.0387
2   monthly_active_users   0.0301
3            plan_tier   0.0188
...

Everything under 0.1 — no action. A bad week looks more like this:

             feature     psi
0           plan_tier  0.3142
1   monthly_active_users  0.2871
2       account_age_days  0.0455
...

Two features crossed 0.25. Time to investigate before precision tanks.

Step 4: Add severity tagging and a JSON report

Cron jobs that print DataFrames are useless at 2am. Emit a structured report.

import json
from datetime import datetime

def tag(psi_value: float) -> str:
    if psi_value < 0.1:
        return "stable"
    if psi_value < 0.25:
        return "moderate"
    return "significant"

report["severity"] = report["psi"].apply(tag)

payload = {
    "model": "churn_classifier",
    "model_version": "2024.07.1",
    "reference_version": "v1",
    "window": "last_7d",
    "generated_at": datetime.utcnow().isoformat(),
    "features": report.to_dict(orient="records"),
    "alert": bool((report["psi"] >= 0.25).any()),
}

with open(f"drift/report_{datetime.utcnow():%Y%m%d}.json", "w") as f:
    json.dump(payload, f, indent=2)

print("alert:", payload["alert"])

The alert boolean is what your cron wrapper checks. If true, post to Slack or page the on-call. If false, the report still lands in object storage for trend analysis.

Step 5: Sanity-check against your model's actual feature importance

A 0.3 PSI on a feature your model barely uses is a footnote. A 0.12 PSI on the top-importance feature is an emergency. Weight the report.

model = joblib.load("models/churn_classifier.joblib")

# Works for tree-based models; use permutation_importance for others
importances = pd.Series(
    model.named_steps["clf"].feature_importances_,
    index=model.named_steps["clf"].feature_names_in_,
).rename("importance")

weighted = report.merge(importances, left_on="feature", right_index=True, how="left")
weighted["impact_score"] = weighted["psi"] * weighted["importance"].fillna(0)
weighted = weighted.sort_values("impact_score", ascending=False)
print(weighted.head(5))

Now your alerts prioritize features that actually move predictions. This single step is what separates a noisy drift monitor from one engineers stop ignoring after two weeks.

Step 6: Schedule it

Wrap the script and add it to whatever scheduler you already run. A minimal crontab entry:

0 8 * * 1 cd /opt/ml-monitoring && /usr/bin/python3 drift_check.py >> logs/drift.log 2>&1

Weekly is the right cadence to start. Daily creates alert fatigue on noisy features; monthly is too slow to beat user complaints. Once you have four to six weeks of reports, you'll know which features are intrinsically noisy and can either widen their thresholds or drop them from alerting.

Step 7: Decide what to do on a real alert

Drift detection without a response playbook is theater. The minimum useful playbook:

Confirm the shift is real, not an upstream pipeline bug. Check row counts, null rates, and the schema of the source table for the same window.
Segment the drift by tenant, plan, or region. SaaS drift is usually concentrated — a single large new customer onboarding can shift seats and monthly_active_users overnight.
Measure precision and recall on whatever labeled data you do have from the drifted window, even if it's small.
Decide: retrain on a window including the new distribution, add a feature transformation that absorbs the shift (log-scale, capping), or accept and document if business impact is negligible.

Common errors and how to fix them

PSI returns `inf` or `nan`

A category exists in the current window but not the reference, and your epsilon is too small or you skipped it. Confirm the eps = 1e-6 guard is applied to both distributions before the log.

Every feature shows PSI > 0.25 on day one

Your reference and current windows aren't comparable. Common causes: reference includes a one-time backfill, current window pulls from a different table with different preprocessing, or your training pipeline applied transformations (scaling, encoding) that your scoring snapshot didn't. Compare raw, pre-transformation features on both sides.

Numeric feature returns PSI = 0.0

The quantile call collapsed to fewer than three unique edges — usually a feature that is mostly zero or constant. Either drop it from monitoring or switch to a categorical treatment (zero vs non-zero).

Categorical PSI spikes on a new category that's 0.1% of traffic

Rare categories dominate PSI because of the log. Either bucket low-frequency categories into other in both reference and current, or filter the report to features where the dominant categories shifted.

Alerts fire but model accuracy is fine

Expected. PSI measures input drift, not performance drift. Use Step 5's importance weighting to suppress alerts on low-impact features, and pair PSI with a delayed-label performance check (weekly precision/recall on labeled samples) for a complete picture.

What this setup doesn't catch

Concept drift — when the relationship between inputs and the target changes but the input distribution stays the same — is invisible to PSI. For that you need labels and a rolling performance metric. PSI also won't help with feature interactions; two features can each look stable while their joint distribution shifts. If your model leans heavily on interactions, add a domain classifier (train a binary classifier to distinguish reference from current; if it scores well above 0.5 AUC, the joint distribution has shifted) as a second layer.

This is the floor, not the ceiling. Teams that mature past this typically graduate to Evidently, NannyML, or WhyLabs for richer reports and UI. But the floor — per-feature PSI, importance-weighted, on a weekly cron — is enough to stop learning about model degradation from a support ticket. For teams building AI-powered products where the model itself is the product, treating monitoring as a Day 1 concern rather than a Day 90 retrofit is the difference between predictable iteration and reactive firefighting.

Frequently Asked Questions

How often should I rebuild the reference distribution?

Rebuild when you retrain the model, not on a calendar. The reference exists to represent the data the current model was fit on; refreshing it more often hides the very drift you're trying to detect. If you retrain quarterly, your reference snapshots roll quarterly.

Is PSI better than the Kolmogorov-Smirnov test for drift?

For monitoring, yes — PSI gives a single bounded number with industry-standard interpretation bands, while KS gives a p-value that grows artificially significant as your sample size grows. KS is better when you have a one-time question about whether two specific samples differ. PSI is better when you're scoring drift on a schedule.

Can I use this for regression models, not just classifiers?

Yes — PSI is computed on input features and is model-agnostic. The Step 5 importance weighting works identically for scikit-learn regressors. The only adjustment is that your delayed-label performance check uses MAE or RMSE instead of precision and recall.

What's the minimum window size for the current period?

Roughly 1,000 rows for stable PSI estimates on numeric features with 10 bins. Below that, bin counts get noisy and PSI swings. If your traffic is too low for weekly windows of that size, extend to bi-weekly rather than reducing bins below five.

How do I integrate this with an existing ML monitoring stack?

The JSON report from Step 4 is the integration point. Push it to Datadog as custom metrics, post the alert summary to Slack via webhook, or load it into your warehouse for trend dashboards. If you want help wiring drift detection into a broader production ML setup, contact CodeNicely for a personalized assessment of your current architecture.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.