Startups AI/ML May 12, 2026 • 7 min read

ML Model Versioning Cheatsheet: Weights, Data, and Code

For: A mid-stage ML engineer at a Series A SaaS company who just got burned by a production rollback — they reverted the model weights but forgot the preprocessing code had also changed, and now they cannot reproduce the version that was actually working

If you've ever rolled back model weights and watched production keep misbehaving, you already know the punchline: the weights weren't the only thing that changed. The preprocessing diverged. The threshold drifted. The training data slice is unrecoverable because someone reran the ETL. This cheatsheet is the reference I wish I'd had after the first time I got burned.

The core idea (read this first)

A model version is not a file. It is a tuple:

model_version = (weights, data_slice, preprocessing_code, hyperparameters, eval_threshold)

Any versioning scheme that does not pin all five simultaneously cannot guarantee reproducibility or safe rollback. Most teams version one or two and assume the rest are stable. They are not.

The five things you must version together

Component	What it is	Common tool	Failure mode if unversioned
Weights	The serialized model artifact (.pt, .pkl, .onnx)	MLflow, W&B Artifacts, S3 + hash	Can't reload the exact trained model
Data slice	Immutable snapshot of training + eval data	DVC, LakeFS, Delta Lake time travel	Retraining produces a different model; can't audit what it learned from
Preprocessing code	Feature engineering, tokenizers, scalers, encoders	Git SHA pinned to the model record	Inference receives features in a different distribution than training
Hyperparameters	Learning rate, seed, architecture config, train/val split	Hydra configs, MLflow params	Can't reproduce training; can't explain behavior
Eval threshold	Decision cutoffs, calibration, business rules	Stored as model metadata, not in app code	Rolling back weights but keeping new threshold = silent regression

ML experiment tracking vs model versioning — they are not the same

	Experiment tracking	Model versioning
Purpose	Compare runs during development	Reproduce and roll back production models
Granularity	Every run, even failed ones	Only promoted candidates
Lifetime	Days to weeks	As long as the model is deployable
Required immutability	Low	Absolute
Typical tools	MLflow Tracking, W&B, Neptune	MLflow Model Registry, SageMaker Model Registry, Vertex AI Model Registry

Experiment trackers will happily let you mutate run metadata. A registry should not. If your registry lets you overwrite a version in place, it's a tracker wearing a costume.

Versioning schemes: pick one and enforce it

Scheme	Example	When to use	Bad at
Semantic (MAJOR.MINOR.PATCH)	`fraud-detector-2.3.1`	Customer-facing models with API contracts	Encoding which dimension changed is judgment-heavy
Content hash	`sha256:a8f3...`	Pipelines where reproducibility > readability	Humans can't remember or compare them
Timestamp + git SHA	`2024-03-14T09:22_a8f3b2c`	Small teams, fast iteration	No notion of "compatible" vs "breaking"
Monotonic integer + tag	`v47 (production)`	Registry-native workflows	Loses meaning across environments

Practical default: monotonic integer in the registry as the canonical ID, plus a semver tag for external consumers, plus the git SHA of the training repo embedded in the artifact metadata.

The minimum metadata to attach to every registered model

Training data pointer — DVC hash, Delta version, or S3 path with object version ID. Not just "s3://bucket/data/".
Preprocessing git SHA — pinned to the same repo and commit that produced the artifact.
Feature schema — column names, dtypes, expected ranges. Serialized, not assumed.
Hyperparameters — full config, including random seed.
Eval metrics — on a frozen eval set, with the threshold used.
Runtime requirements — Python version, framework version, CUDA version, requirements.txt hash.
Promoter — who promoted it and when.

How to version ML models so rollback actually works

Build a model bundle, not a model file. The deployable artifact is a directory: weights + preprocessing code + schema + threshold + requirements. Sign it. Treat it as immutable.
Pin preprocessing in the bundle, not in the serving repo. If preprocessing lives in your API service, rolling back the model doesn't roll back preprocessing. This is the bug that started this post.
Store eval thresholds as model metadata. Not as a constant in the inference service. The model owns its decision boundary.
Make data snapshots immutable. DVC, LakeFS, or Delta Lake time-travel. If your training script can mutate the source data, you don't have a snapshot — you have a wish.
Use the registry's stage transitions as your rollback unit. "Production" is a pointer to a version. Rollback = move the pointer. No file copying, no manual S3 commands.
Smoke-test on promotion. Before flipping the pointer, run inference on a fixed holdout and compare outputs to the recorded eval metrics. If they don't match, your bundle is incomplete.

Model registry best practices (short list)

One model = one registry entry. Variants (quantized, distilled) are separate entries, not silent overwrites.
Stages: None → Staging → Production → Archived. Promotions require a recorded approver.
Never delete an archived version for at least one full incident-response window. You will want it back.
Tag the previous production version as last-known-good on every promotion. Rollback becomes one command.
Block promotion if any of the five tuple components is missing from metadata. Enforce in CI, not in policy docs.

Rollback checklist (tape this to your monitor)

Identify the last-known-good version in the registry.
Verify its bundle contains weights, preprocessing code, schema, threshold.
Re-run smoke tests on the bundle against a fixed eval slice.
Flip the production pointer. Do not copy files.
Confirm the serving container picked up the new bundle (check the version it logs on startup).
Watch prediction distribution and latency for at least one full traffic cycle.
Write down what made the previous version unrollable. Fix that before the next release.

What this approach is bad at

Storage cost. Immutable data snapshots and full model bundles eat space. Budget for it or set explicit retention.
Speed of iteration. Enforced metadata on every promotion slows down hacky experimentation. That's the point, but it's friction.
Cross-team coordination. If preprocessing is owned by the data team and serving by platform, you need an agreement on who owns the bundle. This is organizational, not technical.
Large-model edge cases. For 70B+ LLMs, bundling weights with every version is impractical. Use content-addressed weight stores and version the adapter/config layer instead.

How CodeNicely can help

We've built and rebuilt this exact plumbing for production ML systems. On Cashpo — an AI-driven credit scoring and KYC stack — model rollback safety was non-negotiable because a bad threshold change directly affects loan decisions. We versioned scoring models as bundles tying weights, feature pipelines, and decision thresholds to a single registry record, so reverting a model also reverted the decision boundary it shipped with. If you're at the point where a partial rollback has already burned you, that's the engagement that maps closest. Our AI Studio team can audit your current registry setup and identify which of the five tuple components are unpinned.

Frequently Asked Questions

What's the difference between ML experiment tracking and model versioning?

Experiment tracking records every training run for comparison during development — it prioritizes breadth over immutability. Model versioning records only promoted candidates and treats them as immutable, audit-grade artifacts. MLflow does both, but the Tracking and Model Registry components serve different purposes and should be governed differently.

Do I need DVC if I'm already using MLflow?

Probably yes. MLflow versions models and parameters well, but its data versioning is shallow — it stores pointers, not snapshots. DVC, LakeFS, or Delta Lake time-travel gives you immutable data slices. Use MLflow for the model registry and DVC (or equivalent) for data, and store the data pointer in the MLflow model metadata.

How do I version preprocessing code so it stays in sync with the model?

Ship preprocessing inside the model bundle, not in your serving repository. Either serialize the transformer pipeline (sklearn Pipeline, TF Transform, PyTorch nn.Module) alongside the weights, or pin the exact git SHA of the preprocessing repo in the model metadata and have the serving container check it out at startup. Never let preprocessing live independently in your API service.

Should eval thresholds be in code or model metadata?

Metadata. If the threshold lives in your inference service, rolling back the model leaves the new threshold in place — a silent regression. Treat the threshold as part of the model: it's a learned decision boundary, not application logic.

How long should I keep archived model versions?

At minimum, long enough to cover one full incident-response window plus any regulatory audit period that applies to your domain. For fintech and healthcare workloads we've worked on, that often means years, not weeks. For internal-only models, a few months is usually enough. Set the retention explicitly — don't let it be accidental.

How much does it cost to set up a proper model registry?

It depends heavily on your existing stack, data volume, and compliance requirements. Contact CodeNicely for a personalized assessment of your current setup and what it would take to get to safe-rollback parity.

Building something in AI/ML?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team