Healthcare technology
Startups Healthcare May 7, 2026 • 9 min read

5 Mistakes We Made Shipping AI to a Live Pharmacy Marketplace

For: A product lead at a Series A e-pharmacy or health commerce startup who has shipped an AI-powered substitution, recommendation, or order-routing feature and is now seeing it behave correctly in staging but produce silent patient-facing errors — wrong substitutions, missed contraindications, or inventory mismatches — in production

The model passed every internal eval. Substitution accuracy looked clean on the test set. The recommendation engine handled edge cases in staging. Then it shipped, and within seventy-two hours a customer service lead is forwarding screenshots of a diabetic patient who got offered a substitute that wasn't therapeutically equivalent, an order routed to a pharmacy that ran out of stock four minutes earlier, and a flagged contraindication that the model confidently waved through because the SKU mapping changed last Tuesday.

Welcome to the gap between healthcare AI in a notebook and healthcare AI on a live marketplace. Below are five mistakes we keep seeing — across our own work on platforms like HealthPotli and audits we've done for other teams — that explain most of the silent failures product leads at Series A e-pharmacy startups are quietly firefighting right now.

Mistake 1: Treating the catalog as a stable input

The single most common cause of bad AI behavior in production e-pharmacy is not the model. It's the catalog the model is reasoning over.

Pharmacy catalogs are not e-commerce catalogs. SKUs get re-mapped when distributors change packaging. Strength and dosage fields get back-filled by ops teams in spreadsheets. Generic equivalents get linked or unlinked by content managers based on recall notices. The taxonomy you trained your substitution model on three months ago is already drifting, and nobody told the ML team because the catalog is owned by content ops.

Symptom in production: Substitutions that were valid in your evals start failing in narrow ways — wrong strength suggested, wrong route of administration, branded product offered as a generic equivalent when it isn't.

Why it happens: Your training set was a clean snapshot. Your serving layer reads from a live catalog. The schemas drift, the join keys change, and your model silently starts hallucinating mappings.

How to recover: Version the catalog. Treat it like a model artifact. Every catalog change — SKU additions, salt mappings, therapeutic-equivalence edits — should bump a hash that the inference pipeline reads at request time. If the hash doesn't match what the model was validated against, you log a warning and either fall back to a rules-based path or downgrade confidence. You also need a catalog QA gate that re-runs your substitution eval set against every meaningful catalog change, not just every model release.

Mistake 2: Inventory and inference clocked to different heartbeats

This is the failure mode that scares us most, and the one almost nobody catches in staging.

Your AI inference pipeline takes a query, fetches context, runs the model, returns a recommendation. The whole loop is maybe 400ms. Your inventory signal — what's actually on the shelf at the partner pharmacy — updates every 60 to 120 seconds, sometimes longer if the pharmacy is using a polled POS integration. Sometimes the pharmacist sells the last strip over the counter and the system catches up four minutes later.

So your model is making a confident, plausible recommendation against a snapshot of inventory that is already 90 seconds stale. In staging, with seeded data, it works perfectly. In production, you get order routing to pharmacies that can't fulfill, and substitution suggestions for items that are technically out of stock everywhere in the catchment.

Symptom in production: Cancellation or re-routing rate spikes one to two hours after launch. Customer ops sees "item unavailable" reasons after the order was already confirmed.

Why it happens: Inventory and AI are owned by different teams, with different SLAs, and almost never reconciled to a shared timestamp.

How to recover: Add a freshness budget to every AI decision that touches fulfillment. Every recommendation should carry the timestamp of the inventory data it was conditioned on. If that timestamp is older than your freshness budget — we typically argue for under 30 seconds for hot SKUs, longer for cold catalog — the recommendation either gets re-validated against a live inventory probe or gets demoted. Also: build a reconciliation job that compares what the model recommended against what was actually fulfillable, bucketed by freshness. The correlation between staleness and downstream failure will be obvious within a week of data, and it will tell you exactly where to tighten the budget.

Mistake 3: Confidence thresholds tuned on clean data

Every team we've worked with that ships a substitution or recommendation model picks a confidence threshold during evaluation. "We'll auto-suggest above 0.82, escalate to pharmacist review below." The number sounds defensible because it was tuned against a labeled validation set.

The validation set was clean. Production isn't.

In production, your inputs are messier — abbreviated drug names typed on mobile, OCR'd prescriptions, partial dosage info, customers who entered the brand name wrong. The model still produces a confidence score, and that score still clears your threshold, but the underlying input quality is nothing like what you tuned against. So you get high-confidence wrong answers — exactly the worst class of failure for a healthcare product.

Symptom in production: Pharmacist override rate is low (because the model is "confident"), but customer complaint rate is rising. The model isn't asking for help when it should be.

Why it happens: Confidence calibration is a function of input distribution. When the input distribution shifts, calibration breaks before accuracy does.

How to recover: Re-calibrate against production data, not lab data. Sample real production inputs weekly, label them, and check whether your confidence buckets still mean what you think they mean. If 0.85+ confidence used to be 97% correct and is now 88% correct, that's a calibration problem, not an accuracy problem, and the fix is to raise the auto-suggest threshold and route more cases to human review until you understand the drift. A second, cheaper hedge: add an input-quality classifier that flags noisy or ambiguous queries and forces them down a more conservative path regardless of model confidence.

Mistake 4: No contraindication audit trail

If your AI checks for drug-drug interactions, allergies, or contraindications, you need to be able to answer this question on demand for any historical order: what facts did the model have, what did it conclude, and why?

Most teams can't. They log the input and the output. They don't log the intermediate state — what allergies were known, what active prescriptions were considered, which interaction rules fired, which were suppressed. So when something goes wrong and a regulator, a partner pharmacy, or a hospital legal team asks why the system cleared an order it shouldn't have, you're reverse-engineering from logs that weren't designed to answer the question.

Symptom in production: A real safety incident occurs and you spend three days reconstructing what the model knew at decision time. You find out the patient profile was updated four hours after the order, but you can't prove the model didn't see the update.

Why it happens: ML observability tooling is built for accuracy debugging, not for clinical accountability. The two are not the same.

How to recover: Treat every AI decision that touches patient safety as an immutable event. Log the input, the patient context snapshot (with timestamp), the rules that were evaluated, the rules that fired, the model output, the confidence, and the action taken — all in a single record, all hashed, all queryable by order ID. This is non-negotiable for any e-pharmacy operating in a regulated market. It also turns out to be the single most useful artifact you have when you're debugging silent failures, because it lets you reconstruct exactly the world the model believed it was operating in.

Mistake 5: Shipping AI without a rules-based fallback

The teams who get into the worst trouble are the ones who replaced their rules engine with the model. The teams who recover fastest are the ones who ran both in parallel and let the rules engine act as a sanity check.

A model that suggests a substitute should never be the only thing standing between a customer and a wrong substitution. There should be a deterministic check downstream — same therapeutic class, same strength within tolerance, no contraindication against the patient's profile, no recall flag — that can veto the model. If the model proposes something the rules can't justify, the rules win and the case routes to a pharmacist.

Symptom in production: When the model misbehaves, it misbehaves all the way to the customer because there's no second line of defense. Rollbacks are scary because the model is now load-bearing for the whole funnel.

Why it happens: Teams over-trust evaluation metrics and under-invest in deterministic guardrails. Rules-based systems feel old-fashioned and get deprecated too aggressively.

How to recover: Re-introduce a rules layer in front of every AI decision that affects fulfillment or safety. Define the irreducible safety constraints — the things that must be true regardless of what the model thinks — and encode them as deterministic checks. Measure how often the rules veto the model. That number is one of the best leading indicators you have of model health: if it suddenly spikes, something has changed in either the model or the catalog and you want to know before customers do.

The pattern underneath all five

Every one of these mistakes is a version of the same underlying problem: the AI was validated against a static, idealized world, and the production world is dynamic, messy, and clocked differently from the model.

You don't fix this with a better model. You fix it with infrastructure that respects the gap — versioned catalogs, freshness budgets, calibrated confidence, immutable decision logs, and deterministic guardrails. The model is one component in a system that has to behave correctly even when the model doesn't.

If your team is in the middle of this right now — passing internal QA, failing on live orders, unsure where the failure surface actually is — start with the audit trail. You can't fix what you can't see, and most e-pharmacy AI stacks we've reviewed are flying with one or two of these instruments missing. Build the logs first. The fixes get obvious after that.

Frequently Asked Questions

Why does our AI substitution model perform well in staging but fail in production?

Almost always because staging uses a clean catalog snapshot and seeded inventory, while production has a drifting catalog and stale inventory signals. The model isn't wrong — it's reasoning over inputs that don't match what it was validated against. Start by checking catalog version drift and inventory freshness at the moment of inference.

How often should we re-validate confidence thresholds for a healthcare recommendation system?

Weekly at minimum during the first three months after launch, then at least monthly once you have stable production traffic. Any time you see a meaningful shift in input distribution — a new partner pharmacy onboarded, a new prescription intake channel, a UI change to the search box — re-calibrate. Confidence calibration drifts before accuracy does, and it's the early warning signal you don't want to miss.

Should we replace our rules engine with an AI model for drug substitution?

No. Run them in parallel. The model proposes, the rules dispose. Deterministic safety checks — therapeutic equivalence, strength tolerance, contraindication, recall status — should always be able to veto a model recommendation. This also gives you a rollback path that doesn't require taking the AI offline.

What should an AI decision log contain for a regulated pharmacy product?

At minimum: order ID, timestamp, full input, patient context snapshot with its own timestamp, catalog version hash, inventory freshness, rules evaluated and their outcomes, model output, confidence score, and final action. All immutable, all queryable, all retained per your jurisdiction's record-keeping requirements. This is what lets you reconstruct exactly what the system knew when it made a decision.

How do we estimate effort to fix these issues across our AI stack?

It depends heavily on how your catalog, inventory, and inference layers are currently coupled, and which of the five gaps are present. For a tailored assessment of your specific stack, talk to CodeNicely — we've worked through these failure modes on live healthcare marketplaces and can help you scope the right sequence of fixes.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.