ML Model Versioning Cheatsheet: Weights, Data, and Code
For: A mid-stage ML engineer at a Series A SaaS company who just got burned by a production rollback — they reverted the model weights but forgot the preprocessing code had also changed, and now they cannot reproduce the version that was actually working
If you've ever rolled back model weights and watched production keep misbehaving, you already know the punchline: the weights weren't the only thing that changed. The preprocessing diverged. The threshold drifted. The training data slice is unrecoverable because someone reran the ETL. This cheatsheet is the reference I wish I'd had after the first time I got burned.
The core idea (read this first)
A model version is not a file. It is a tuple:
model_version = (weights, data_slice, preprocessing_code, hyperparameters, eval_threshold)
Any versioning scheme that does not pin all five simultaneously cannot guarantee reproducibility or safe rollback. Most teams version one or two and assume the rest are stable. They are not.
The five things you must version together
| Component | What it is | Common tool | Failure mode if unversioned |
|---|---|---|---|
| Weights | The serialized model artifact (.pt, .pkl, .onnx) | MLflow, W&B Artifacts, S3 + hash | Can't reload the exact trained model |
| Data slice | Immutable snapshot of training + eval data | DVC, LakeFS, Delta Lake time travel | Retraining produces a different model; can't audit what it learned from |
| Preprocessing code | Feature engineering, tokenizers, scalers, encoders | Git SHA pinned to the model record | Inference receives features in a different distribution than training |
| Hyperparameters | Learning rate, seed, architecture config, train/val split | Hydra configs, MLflow params | Can't reproduce training; can't explain behavior |
| Eval threshold | Decision cutoffs, calibration, business rules | Stored as model metadata, not in app code | Rolling back weights but keeping new threshold = silent regression |
ML experiment tracking vs model versioning — they are not the same
| Experiment tracking | Model versioning | |
|---|---|---|
| Purpose | Compare runs during development | Reproduce and roll back production models |
| Granularity | Every run, even failed ones | Only promoted candidates |
| Lifetime | Days to weeks | As long as the model is deployable |
| Required immutability | Low | Absolute |
| Typical tools | MLflow Tracking, W&B, Neptune | MLflow Model Registry, SageMaker Model Registry, Vertex AI Model Registry |
Experiment trackers will happily let you mutate run metadata. A registry should not. If your registry lets you overwrite a version in place, it's a tracker wearing a costume.
Versioning schemes: pick one and enforce it
| Scheme | Example | When to use | Bad at |
|---|---|---|---|
| Semantic (MAJOR.MINOR.PATCH) | fraud-detector-2.3.1 | Customer-facing models with API contracts | Encoding which dimension changed is judgment-heavy |
| Content hash | sha256:a8f3... | Pipelines where reproducibility > readability | Humans can't remember or compare them |
| Timestamp + git SHA | 2024-03-14T09:22_a8f3b2c | Small teams, fast iteration | No notion of "compatible" vs "breaking" |
| Monotonic integer + tag | v47 (production) | Registry-native workflows | Loses meaning across environments |
Practical default: monotonic integer in the registry as the canonical ID, plus a semver tag for external consumers, plus the git SHA of the training repo embedded in the artifact metadata.
The minimum metadata to attach to every registered model
- Training data pointer — DVC hash, Delta version, or S3 path with object version ID. Not just "s3://bucket/data/".
- Preprocessing git SHA — pinned to the same repo and commit that produced the artifact.
- Feature schema — column names, dtypes, expected ranges. Serialized, not assumed.
- Hyperparameters — full config, including random seed.
- Eval metrics — on a frozen eval set, with the threshold used.
- Runtime requirements — Python version, framework version, CUDA version, requirements.txt hash.
- Promoter — who promoted it and when.
How to version ML models so rollback actually works
- Build a model bundle, not a model file. The deployable artifact is a directory: weights + preprocessing code + schema + threshold + requirements. Sign it. Treat it as immutable.
- Pin preprocessing in the bundle, not in the serving repo. If preprocessing lives in your API service, rolling back the model doesn't roll back preprocessing. This is the bug that started this post.
- Store eval thresholds as model metadata. Not as a constant in the inference service. The model owns its decision boundary.
- Make data snapshots immutable. DVC, LakeFS, or Delta Lake time-travel. If your training script can mutate the source data, you don't have a snapshot — you have a wish.
- Use the registry's stage transitions as your rollback unit. "Production" is a pointer to a version. Rollback = move the pointer. No file copying, no manual S3 commands.
- Smoke-test on promotion. Before flipping the pointer, run inference on a fixed holdout and compare outputs to the recorded eval metrics. If they don't match, your bundle is incomplete.
Model registry best practices (short list)
- One model = one registry entry. Variants (quantized, distilled) are separate entries, not silent overwrites.
- Stages:
None→Staging→Production→Archived. Promotions require a recorded approver. - Never delete an archived version for at least one full incident-response window. You will want it back.
- Tag the previous production version as
last-known-goodon every promotion. Rollback becomes one command. - Block promotion if any of the five tuple components is missing from metadata. Enforce in CI, not in policy docs.
Rollback checklist (tape this to your monitor)
- Identify the
last-known-goodversion in the registry. - Verify its bundle contains weights, preprocessing code, schema, threshold.
- Re-run smoke tests on the bundle against a fixed eval slice.
- Flip the production pointer. Do not copy files.
- Confirm the serving container picked up the new bundle (check the version it logs on startup).
- Watch prediction distribution and latency for at least one full traffic cycle.
- Write down what made the previous version unrollable. Fix that before the next release.
What this approach is bad at
- Storage cost. Immutable data snapshots and full model bundles eat space. Budget for it or set explicit retention.
- Speed of iteration. Enforced metadata on every promotion slows down hacky experimentation. That's the point, but it's friction.
- Cross-team coordination. If preprocessing is owned by the data team and serving by platform, you need an agreement on who owns the bundle. This is organizational, not technical.
- Large-model edge cases. For 70B+ LLMs, bundling weights with every version is impractical. Use content-addressed weight stores and version the adapter/config layer instead.
How CodeNicely can help
We've built and rebuilt this exact plumbing for production ML systems. On Cashpo — an AI-driven credit scoring and KYC stack — model rollback safety was non-negotiable because a bad threshold change directly affects loan decisions. We versioned scoring models as bundles tying weights, feature pipelines, and decision thresholds to a single registry record, so reverting a model also reverted the decision boundary it shipped with. If you're at the point where a partial rollback has already burned you, that's the engagement that maps closest. Our AI Studio team can audit your current registry setup and identify which of the five tuple components are unpinned.
Frequently Asked Questions
What's the difference between ML experiment tracking and model versioning?
Experiment tracking records every training run for comparison during development — it prioritizes breadth over immutability. Model versioning records only promoted candidates and treats them as immutable, audit-grade artifacts. MLflow does both, but the Tracking and Model Registry components serve different purposes and should be governed differently.
Do I need DVC if I'm already using MLflow?
Probably yes. MLflow versions models and parameters well, but its data versioning is shallow — it stores pointers, not snapshots. DVC, LakeFS, or Delta Lake time-travel gives you immutable data slices. Use MLflow for the model registry and DVC (or equivalent) for data, and store the data pointer in the MLflow model metadata.
How do I version preprocessing code so it stays in sync with the model?
Ship preprocessing inside the model bundle, not in your serving repository. Either serialize the transformer pipeline (sklearn Pipeline, TF Transform, PyTorch nn.Module) alongside the weights, or pin the exact git SHA of the preprocessing repo in the model metadata and have the serving container check it out at startup. Never let preprocessing live independently in your API service.
Should eval thresholds be in code or model metadata?
Metadata. If the threshold lives in your inference service, rolling back the model leaves the new threshold in place — a silent regression. Treat the threshold as part of the model: it's a learned decision boundary, not application logic.
How long should I keep archived model versions?
At minimum, long enough to cover one full incident-response window plus any regulatory audit period that applies to your domain. For fintech and healthcare workloads we've worked on, that often means years, not weeks. For internal-only models, a few months is usually enough. Set the retention explicitly — don't let it be accidental.
How much does it cost to set up a proper model registry?
It depends heavily on your existing stack, data volume, and compliance requirements. Contact CodeNicely for a personalized assessment of your current setup and what it would take to get to safe-rollback parity.
Building something in AI/ML?
CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.
Talk to our team_1751731246795-BygAaJJK.png)