Healthcare technology
Startups Healthcare May 2, 2026 • 11 min read

Fine-Tune a Prescription NER Model on 500 Labeled Lines

For: A solo ML engineer at a seed-stage e-pharmacy startup who needs to extract drug names, dosages, and refill instructions from messy OCR'd prescription scans — and has been told off-the-shelf medical NLP APIs are too expensive and too US-centric for their patient population

You have a folder of OCR'd prescription scans, a tight runway, and a medical NER API that bills per call and tags Crocin as O. The good news: you do not need a labeled corpus the size of MIMIC to beat a general-purpose model on your own format. You need around 500 lines that look like the messy reality your OCR pipeline produces — regional brand vocabulary, BD/TDS abbreviations, missing spaces, the works.

This is a tutorial for a solo ML engineer. We will fine-tune a spaCy NER model on prescription text, label four entity types (DRUG, DOSAGE, FREQUENCY, REFILL), and walk through the exact commands. By the end you will have a model that runs locally, costs nothing per inference, and outperforms generic biomedical NER on your data.

Why 500 lines is enough

The dominant error source when you take a model trained on US clinical notes and run it on Indian or Middle Eastern e-pharmacy prescriptions is not sample size. It is corpus mismatch. The model has never seen Dolo 650, has never seen 1-0-1 as a frequency, has never seen R/L for refill. No amount of MIMIC pretraining helps with vocabulary it has never encountered.

A few hundred lines that reflect your actual OCR noise and regional drug list close that gap fast. The tradeoff: this model will not generalize to free-text doctor notes, hospital discharge summaries, or anything outside the prescription format you trained on. That is fine. You are not building a biomedical research tool. You are extracting four fields from a tightly-scoped document type.

Prerequisites

Step 1: Sample your data the right way

Do not random-sample from your OCR output. Stratify. You want your 500 lines to cover the variation that actually exists, not just the most common pattern.

import pandas as pd
import random

df = pd.read_csv("ocr_lines.csv")  # columns: line_id, raw_text, source_clinic

# crude stratification: by clinic, by line length bucket
df["len_bucket"] = pd.cut(df["raw_text"].str.len(), bins=[0, 30, 60, 120, 999])
sample = df.groupby(["source_clinic", "len_bucket"], group_keys=False)\
           .apply(lambda g: g.sample(min(len(g), 25), random_state=42))

sample = sample.sample(500, random_state=42)
sample.to_json("to_label.jsonl", orient="records", lines=True)
print(f"sampled {len(sample)} lines from {df['source_clinic'].nunique()} clinics")

Expected output:

sampled 500 lines from 14 clinics

Hold out 50 of these as a test set before you label anything. Label them with the same care, then do not look at them again until the end.

Step 2: Define your label scheme

Keep it tight. Four labels is enough for a prescription:

Resist adding ROUTE, INDICATION, or DOCTOR_NOTE on the first pass. Each new label roughly doubles your annotation effort and your inter-annotator disagreement. Ship four labels well, add the fifth in v2.

Step 3: Annotate

Use Prodigy if you have a license, Label Studio if you don't. The output you want is spaCy's training format: text plus a list of (start, end, label) spans.

A labeled example looks like this:

{
  "text": "Tab. Dolo 650mg 1-0-1 x 5 days",
  "entities": [
    [5, 9, "DRUG"],
    [10, 15, "DOSAGE"],
    [16, 21, "FREQUENCY"],
    [22, 30, "REFILL"]
  ]
}

Two practical tips that will save you days:

  1. Write a labeling guide before you start. Decide whether Tab. is part of DRUG (it isn't, in our scheme). Decide whether 650mg includes the unit (yes). Write it down. Re-read it every 100 examples.
  2. Annotate in two passes. First pass fast and rough. Second pass, review your own labels. You will fix 10–15% of them. That fix is worth more than 100 new examples.

Step 4: Convert to spaCy binary format

spaCy 3 trains from .spacy files, not JSON. Convert:

import spacy
from spacy.tokens import DocBin
import json

nlp = spacy.blank("en")

def to_docbin(jsonl_path, out_path):
    db = DocBin()
    skipped = 0
    with open(jsonl_path) as f:
        for line in f:
            r = json.loads(line)
            doc = nlp.make_doc(r["text"])
            ents = []
            for start, end, label in r["entities"]:
                span = doc.char_span(start, end, label=label, alignment_mode="contract")
                if span is None:
                    skipped += 1
                    continue
                ents.append(span)
            doc.ents = ents
            db.add(doc)
    db.to_disk(out_path)
    print(f"wrote {out_path}, skipped {skipped} misaligned spans")

to_docbin("train.jsonl", "train.spacy")
to_docbin("dev.jsonl", "dev.spacy")

Expected output:

wrote train.spacy, skipped 3 misaligned spans
wrote dev.spacy, skipped 0 misaligned spans

If skipped is more than 1–2% of your spans, your offsets are off. Fix them now, not later.

Step 5: Generate a config and train

spaCy ships a config generator. For 500 examples, start with the CPU-friendly pipeline:

python -m spacy init config config.cfg \
  --lang en --pipeline ner --optimize accuracy

Open config.cfg and check two things:

Train:

python -m spacy train config.cfg \
  --output ./model \
  --paths.train ./train.spacy \
  --paths.dev ./dev.spacy

Expected output (truncated):

E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  --------  ------  ------  ------  -----
  0       0    142.31    0.00    0.00    0.00   0.00
  2     200    980.44   71.20   74.10   68.50   0.71
  6     600    412.18   84.60   86.20   83.10   0.85
 12    1200    188.05   89.10   90.40   87.80   0.89
 ...
 38    4400     22.41   91.80   92.50   91.10   0.92

If your dev F1 plateaus around 0.85–0.92 on a 4-label prescription task with 500 lines, you are in the expected range. If you are below 0.75, the issue is almost always label noise, not model capacity.

Step 6: Evaluate on the held-out set

python -m spacy evaluate ./model/model-best ./test.spacy \
  --output metrics.json

Look at per-label F1, not just overall. A common pattern: DRUG and DOSAGE score above 0.93, FREQUENCY around 0.88, REFILL around 0.82. REFILL is hardest because the surface forms vary the most. If REFILL is dragging you down, label 50 more REFILL-heavy examples. Targeted labeling beats random labeling at this scale.

Step 7: Run inference and post-process

import spacy
nlp = spacy.load("./model/model-best")

text = "Cap. Pan-D 40mg 1-0-0 before food x 10 days"
doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text:20s} {ent.label_}")

Expected output:

Pan-D                DRUG
40mg                 DOSAGE
1-0-0                FREQUENCY
x 10 days            REFILL

Do not stop at the model output. Add a post-processing layer:

This last step matters more than another point of F1. A pharmacist reviewing a 5% unknown-drug queue is a safety feature. A model that silently mispredicts is a liability. Teams building patient-facing pharmacy tooling — including work we've done on AI-assisted drug interaction flows — gate the model output behind exactly this kind of validation layer.

Step 8: Active learning loop

You will deploy this model and immediately see it fail on inputs you did not anticipate. Set up a loop:

  1. Log every prediction with confidence (use nlp.pipe with nlp.get_pipe("ner").predict to get scores).
  2. Surface low-confidence predictions and predictions on tokens not in your training vocab.
  3. Label 50 of those a week. Retrain monthly.

This is how you get from 0.90 F1 to 0.96 F1 without ever sitting through a 5000-example annotation marathon.

Common errors

"E024: Could not find a gold-standard alignment"

Your character offsets do not align with spaCy's tokenization. Use alignment_mode="contract" in doc.char_span, or switch to "expand" if you'd rather over-include. Fix the underlying offsets if more than a handful of spans are dropped.

Dev F1 stuck below 0.70

Almost always label inconsistency. Pull 30 random training examples, re-label them blind, and compare to the originals. If you disagree with yourself on more than 15%, your label scheme is ambiguous. Tighten the guide and re-label.

Model predicts DRUG correctly but misses common ones like "Crocin"

Vocabulary issue. en_core_web_lg has no vector for Crocin. Either include 5–10 examples of each high-frequency drug in training, or initialize from a transformer base (en_core_web_trf) which handles unseen tokens better at the cost of inference speed.

OCR noise like "Dolo650mg" (no space) is missed

Tokenizer treats it as one token. Either add a custom tokenizer rule that splits letters from digits, or include enough no-space examples in training that the model learns to predict a multi-label token. The first option is cleaner.

Inference is too slow at scale

Use nlp.pipe(texts, batch_size=64, n_process=4) instead of looping nlp(text). Expect 5–10x speedup on CPU. If still too slow, distill to a smaller pipeline or move to ONNX runtime.

What this approach is bad at

Be honest with yourself about the limits:

For most seed-stage e-pharmacy startups, this is the right tradeoff. You ship a fast, cheap, controllable extractor and stack the safety logic on top. Teams that get this stack right tend to be the ones building deliberate, narrow ML components rather than calling a single magic API. If you are scoping something larger across pharmacy, lending, or logistics workflows, the way we approach applied AI builds follows the same pattern: small models, tight scopes, explicit validation layers.

Frequently Asked Questions

How many labeled examples do I really need to fine-tune a prescription NER model?

For a 4-label scheme on a tightly-scoped document type, 400–600 carefully labeled lines typically gets you to 0.88–0.92 F1. The variance comes from label consistency and how well your sample reflects production data, not raw count. Adding 1000 sloppy examples will hurt you; adding 100 carefully-reviewed examples on your weakest label will help.

Should I use spaCy or a transformer model like BioBERT?

Start with spaCy's en_core_web_lg pipeline. It trains in minutes on CPU and is easy to debug. Move to a transformer base (spaCy's en_core_web_trf or a fine-tuned BioBERT) only if you hit a ceiling and have evidence the bottleneck is model capacity rather than label quality. For most prescription extraction tasks, the transformer is overkill.

Can I use ChatGPT or Claude to label my training data?

Yes, for a first pass — but always have a human review every label before training. LLMs are reasonable at extracting drug names but make consistent errors on regional brands and abbreviated frequencies, and those errors will be baked into your model. Use the LLM to draft, use a human to verify.

How do I handle multilingual prescriptions (English drug names, local-language instructions)?

Train a multilingual base (xx_ent_wiki_sm or a multilingual transformer) and include code-mixed examples in your training set proportional to their real frequency. Do not try to translate first — translation introduces errors that compound with NER errors. Label the mixed text directly.

Is fine-tuning a medical NER model HIPAA or DPDP compliant?

The model itself is just weights — compliance lives in your data handling pipeline. De-identify training data before annotation, control access to the labeled corpus, and audit who runs inference where. For a production deployment in regulated markets, talk to CodeNicely for a personalized assessment of your compliance posture.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.