AI/ML technology
Startups AI/ML May 12, 2026 • 7 min read

ML Model Versioning Cheatsheet: Weights, Data, and Code

For: A mid-stage ML engineer at a Series A SaaS company who just got burned by a production rollback — they reverted the model weights but forgot the preprocessing code had also changed, and now they cannot reproduce the version that was actually working

If you've ever rolled back model weights and watched production keep misbehaving, you already know the punchline: the weights weren't the only thing that changed. The preprocessing diverged. The threshold drifted. The training data slice is unrecoverable because someone reran the ETL. This cheatsheet is the reference I wish I'd had after the first time I got burned.

The core idea (read this first)

A model version is not a file. It is a tuple:

model_version = (weights, data_slice, preprocessing_code, hyperparameters, eval_threshold)

Any versioning scheme that does not pin all five simultaneously cannot guarantee reproducibility or safe rollback. Most teams version one or two and assume the rest are stable. They are not.

The five things you must version together

ComponentWhat it isCommon toolFailure mode if unversioned
WeightsThe serialized model artifact (.pt, .pkl, .onnx)MLflow, W&B Artifacts, S3 + hashCan't reload the exact trained model
Data sliceImmutable snapshot of training + eval dataDVC, LakeFS, Delta Lake time travelRetraining produces a different model; can't audit what it learned from
Preprocessing codeFeature engineering, tokenizers, scalers, encodersGit SHA pinned to the model recordInference receives features in a different distribution than training
HyperparametersLearning rate, seed, architecture config, train/val splitHydra configs, MLflow paramsCan't reproduce training; can't explain behavior
Eval thresholdDecision cutoffs, calibration, business rulesStored as model metadata, not in app codeRolling back weights but keeping new threshold = silent regression

ML experiment tracking vs model versioning — they are not the same

Experiment trackingModel versioning
PurposeCompare runs during developmentReproduce and roll back production models
GranularityEvery run, even failed onesOnly promoted candidates
LifetimeDays to weeksAs long as the model is deployable
Required immutabilityLowAbsolute
Typical toolsMLflow Tracking, W&B, NeptuneMLflow Model Registry, SageMaker Model Registry, Vertex AI Model Registry

Experiment trackers will happily let you mutate run metadata. A registry should not. If your registry lets you overwrite a version in place, it's a tracker wearing a costume.

Versioning schemes: pick one and enforce it

SchemeExampleWhen to useBad at
Semantic (MAJOR.MINOR.PATCH)fraud-detector-2.3.1Customer-facing models with API contractsEncoding which dimension changed is judgment-heavy
Content hashsha256:a8f3...Pipelines where reproducibility > readabilityHumans can't remember or compare them
Timestamp + git SHA2024-03-14T09:22_a8f3b2cSmall teams, fast iterationNo notion of "compatible" vs "breaking"
Monotonic integer + tagv47 (production)Registry-native workflowsLoses meaning across environments

Practical default: monotonic integer in the registry as the canonical ID, plus a semver tag for external consumers, plus the git SHA of the training repo embedded in the artifact metadata.

The minimum metadata to attach to every registered model

How to version ML models so rollback actually works

  1. Build a model bundle, not a model file. The deployable artifact is a directory: weights + preprocessing code + schema + threshold + requirements. Sign it. Treat it as immutable.
  2. Pin preprocessing in the bundle, not in the serving repo. If preprocessing lives in your API service, rolling back the model doesn't roll back preprocessing. This is the bug that started this post.
  3. Store eval thresholds as model metadata. Not as a constant in the inference service. The model owns its decision boundary.
  4. Make data snapshots immutable. DVC, LakeFS, or Delta Lake time-travel. If your training script can mutate the source data, you don't have a snapshot — you have a wish.
  5. Use the registry's stage transitions as your rollback unit. "Production" is a pointer to a version. Rollback = move the pointer. No file copying, no manual S3 commands.
  6. Smoke-test on promotion. Before flipping the pointer, run inference on a fixed holdout and compare outputs to the recorded eval metrics. If they don't match, your bundle is incomplete.

Model registry best practices (short list)

Rollback checklist (tape this to your monitor)

  1. Identify the last-known-good version in the registry.
  2. Verify its bundle contains weights, preprocessing code, schema, threshold.
  3. Re-run smoke tests on the bundle against a fixed eval slice.
  4. Flip the production pointer. Do not copy files.
  5. Confirm the serving container picked up the new bundle (check the version it logs on startup).
  6. Watch prediction distribution and latency for at least one full traffic cycle.
  7. Write down what made the previous version unrollable. Fix that before the next release.

What this approach is bad at

How CodeNicely can help

We've built and rebuilt this exact plumbing for production ML systems. On Cashpo — an AI-driven credit scoring and KYC stack — model rollback safety was non-negotiable because a bad threshold change directly affects loan decisions. We versioned scoring models as bundles tying weights, feature pipelines, and decision thresholds to a single registry record, so reverting a model also reverted the decision boundary it shipped with. If you're at the point where a partial rollback has already burned you, that's the engagement that maps closest. Our AI Studio team can audit your current registry setup and identify which of the five tuple components are unpinned.

Frequently Asked Questions

What's the difference between ML experiment tracking and model versioning?

Experiment tracking records every training run for comparison during development — it prioritizes breadth over immutability. Model versioning records only promoted candidates and treats them as immutable, audit-grade artifacts. MLflow does both, but the Tracking and Model Registry components serve different purposes and should be governed differently.

Do I need DVC if I'm already using MLflow?

Probably yes. MLflow versions models and parameters well, but its data versioning is shallow — it stores pointers, not snapshots. DVC, LakeFS, or Delta Lake time-travel gives you immutable data slices. Use MLflow for the model registry and DVC (or equivalent) for data, and store the data pointer in the MLflow model metadata.

How do I version preprocessing code so it stays in sync with the model?

Ship preprocessing inside the model bundle, not in your serving repository. Either serialize the transformer pipeline (sklearn Pipeline, TF Transform, PyTorch nn.Module) alongside the weights, or pin the exact git SHA of the preprocessing repo in the model metadata and have the serving container check it out at startup. Never let preprocessing live independently in your API service.

Should eval thresholds be in code or model metadata?

Metadata. If the threshold lives in your inference service, rolling back the model leaves the new threshold in place — a silent regression. Treat the threshold as part of the model: it's a learned decision boundary, not application logic.

How long should I keep archived model versions?

At minimum, long enough to cover one full incident-response window plus any regulatory audit period that applies to your domain. For fintech and healthcare workloads we've worked on, that often means years, not weeks. For internal-only models, a few months is usually enough. Set the retention explicitly — don't let it be accidental.

How much does it cost to set up a proper model registry?

It depends heavily on your existing stack, data volume, and compliance requirements. Contact CodeNicely for a personalized assessment of your current setup and what it would take to get to safe-rollback parity.

Building something in AI/ML?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team