Startups SaaS May 11, 2026 • 11 min read

Fine-Tune an Embedding Model on Your Own Docs in 6 Steps

Q: How many training pairs do I actually need to fine-tune an embedding model?

500 high-quality pairs from real user queries will outperform 5,000 LLM-generated synthetic pairs. Quality dominates quantity. If you can get to 1,500–2,000 mined from logs and support tickets, you're in a comfortable zone for most B2B SaaS knowledge bases.

Q: Should I fine-tune the embedding model or just add a re-ranker?

Re-rankers help when retrieval surfaces the right doc in the top 20 but not the top 3. They don't help when retrieval misses the doc entirely, which is the vocabulary-mismatch problem. If your Recall@20 is already high, add a re-ranker. If it's not, fine-tune the embedder first.

Q: Will a fine-tuned open-source model really beat OpenAI's text-embedding-3-large on my domain?

On your domain, frequently yes. On general benchmarks, no. text-embedding-3-large is broader; a fine-tuned bge-base is sharper on your specific vocabulary. The honest answer depends on how distinctive your domain language is — build an eval set and measure rather than guess.

Q: How often do I need to re-train the embedding model?

Trigger re-training when you ship significant new product surface area with new terminology, or when your eval-set metrics drift more than 5 points from launch baseline. For most stable B2B SaaS products, once or twice a year is enough. For rapidly evolving products, quarterly.

Q: What does it cost to do this end-to-end with CodeNicely?

It depends on the size of your corpus, the quality of your existing query logs, and whether you need ongoing re-training infrastructure or a one-time lift. Contact CodeNicely for a personalized assessment and we'll scope it against your actual data.

For: A mid-stage B2B SaaS engineer who shipped a RAG feature on top of OpenAI or Cohere embeddings, tuned chunking and re-ranking to exhaustion, and still watches users complain that search misses obvious results — because the embedding model was never trained on their product's vocabulary

You've shipped a RAG feature. You tuned chunk sizes, added a Cohere re-ranker, played with hybrid BM25, and the support channel still pings you with the same complaint: a user searched for MRR churn cohort and got back a doc about monthly billing reconciliation. The retrieval looks confident. It's just wrong.

The defect usually isn't in your pipeline. It's in the representation. A generic embedding model — text-embedding-3-small, embed-english-v3.0, whatever you're using — was trained on web text. It has never seen your product's vocabulary. In a SaaS knowledge base, ARR and annual recurring revenue can sit surprisingly far apart in vector space. Churn and cancellation are not always near-neighbors. Your customers know they're synonyms. Your model doesn't.

This tutorial walks through fine-tuning a small open-source embedding model on your own docs, using contrastive pairs mined from query logs. You don't need a research budget. 500–2,000 good pairs and a single GPU (or even a beefy CPU for a small model) will outperform a larger off-the-shelf model on your domain — and cut your inference bill because you're now serving a 33M or 110M parameter model instead of calling an API.

What you'll need before starting

Python 3.10+ and a GPU with 8GB+ VRAM (Colab T4 is fine for models up to bge-base). CPU works for all-MiniLM-L6-v2 but will be slow.
sentence-transformers >= 3.0, datasets, torch.
A corpus of your own docs (help center, product wiki, API reference) — a few hundred to a few thousand passages is enough to start.
Query logs, support tickets, or any text that captures how users phrase things. This is the gold. Without it, you're fine-tuning on synthetic noise.
An evaluation set: 50–200 (query, correct_doc) pairs you trust. Hand-label these. Don't skip.

pip install -U sentence-transformers datasets accelerate

Step 1: Pick a base model honestly

Don't start with the biggest model on the MTEB leaderboard. Start with the smallest one that beats your current OpenAI baseline by < 5 points on your eval set. You're going to fine-tune it; the gap will close.

Reasonable defaults:

BAAI/bge-small-en-v1.5 — 33M params, 384 dims. Fast, cheap, surprisingly strong after fine-tuning.
BAAI/bge-base-en-v1.5 — 110M params, 768 dims. The sensible default.
intfloat/e5-base-v2 — strong on retrieval, uses query: / passage: prefixes (which you must respect).

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-base-en-v1.5")
print(model)

Expected output: a model summary showing the transformer backbone (BertModel) and a pooling layer with output dim 768.

Step 2: Mine training pairs from real signals

This is the step that determines whether the fine-tune works. The model needs to learn what's similar in your domain and what's not. You have four cheap sources for pairs:

Query logs + click-through. If a user searched how to invite teammate and clicked the result titled Adding users to your workspace, that's a positive pair.
Support tickets + resolved doc links. Support agent linked KB article #482 to close the ticket → (ticket_subject, article_482) is a positive pair.
Title ↔ body pairs from your own docs. Free, noisy, but useful as augmentation.
LLM-generated queries. For each doc, prompt GPT-4 or Claude to generate 3 realistic user questions it answers. Filter aggressively.

Save them as a JSONL of {"query": ..., "positive": ...}. Aim for 500 minimum. 2,000 is comfortable.

import json

with open("pairs.jsonl", "w") as f:
    for q, p in mined_pairs:  # your list of (query, positive_passage)
        f.write(json.dumps({"query": q, "positive": p}) + "\n")

Don't bother manually picking negatives. We'll use in-batch negatives via MultipleNegativesRankingLoss, which treats every other positive in the batch as a negative. It's the workhorse loss for this task.

Step 3: Build a real evaluation set before you train

If you can't measure it, don't ship it. Create a held-out set of 50–200 queries with their correct doc IDs. Compute Recall@5 and MRR@10 with your current embedding model first. Write that number down. Tape it to your monitor.

from sentence_transformers.evaluation import InformationRetrievalEvaluator

# queries: {qid: query_text}
# corpus:  {did: doc_text}
# relevant_docs: {qid: set([did, ...])}

evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name="saas-kb-eval",
    mrr_at_k=[10],
    accuracy_at_k=[1, 5],
    precision_recall_at_k=[5, 10],
)

baseline = evaluator(model)
print(baseline)

Expected output: a dict with values like {'saas-kb-eval_cosine_recall@5': 0.62, 'saas-kb-eval_cosine_mrr@10': 0.48, ...}. Your numbers will differ. What matters is having a number.

Step 4: Fine-tune with MultipleNegativesRankingLoss

The sentence-transformers v3 API uses the HuggingFace Trainer. Here's a minimal but real training script:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

dataset = load_dataset("json", data_files="pairs.jsonl", split="train")
dataset = dataset.train_test_split(test_size=0.1, seed=42)

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="./bge-base-saas",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    logging_steps=20,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_saas-kb-eval_cosine_mrr@10",
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    loss=loss,
    evaluator=evaluator,
)

trainer.train()
model.save_pretrained("./bge-base-saas-final")

A few honest notes on the hyperparameters:

Batch size matters more than epochs. Larger batch = more in-batch negatives = better signal. If you have VRAM, push it to 64 or 128.
3 epochs is usually enough. Past that, you'll overfit on a small set and regress on out-of-distribution queries.
Learning rate 2e-5 is the standard. Don't get clever unless your loss is unstable.
Expected wall-clock on a T4 with 1,500 pairs: roughly 5–15 minutes. On CPU with MiniLM: 30–60 minutes.

Step 5: Re-evaluate and sanity-check on real queries

Run the evaluator on your fine-tuned model:

tuned = SentenceTransformer("./bge-base-saas-final")
results = evaluator(tuned)
print(results)

A working fine-tune on a domain-mismatched base model typically lifts Recall@5 by 8–20 points and MRR@10 by a similar amount. If you see < 2 points improvement, something's wrong — usually your training pairs are too close to noise (LLM-generated without filtering) or your eval set overlaps with training data.

Then do the eyeball test. Take the 10 queries your team has personally complained about and run them through both models:

queries = ["MRR churn cohort", "invite teammate sso", "webhook retry policy"]
for q in queries:
    q_emb = tuned.encode(q, normalize_embeddings=True)
    scores = (corpus_embeddings @ q_emb).tolist()
    top = sorted(zip(scores, corpus_titles), reverse=True)[:3]
    print(q, "→", top)

You should see results that actually look like answers a human would give. If MRR churn cohort still returns billing reconciliation, your pairs didn't cover that vocabulary. Add 20 more pairs that explicitly bridge those terms and retrain.

Step 6: Ship it without breaking your index

You changed embedding dimensions? No, you didn't — you fine-tuned the same architecture, so dim stays at 768. But the vectors are different. You must re-encode your entire corpus before swapping the model in production.

# Re-encode the corpus with the new model
corpus_embeddings = tuned.encode(
    list(corpus.values()),
    batch_size=64,
    normalize_embeddings=True,
    show_progress_bar=True,
)

# Write to your vector DB (pgvector, Qdrant, Pinecone, etc.)
# Use a new collection / namespace so you can A/B against the old one.

Run both indices in parallel for a week. Route 10% of traffic to the new one. Watch click-through-rate on the top result and zero-result rates. If both move in the right direction, ramp.

Common errors and what they actually mean

Loss goes to zero in the first 50 steps

Your pairs are too easy — usually because positives are near-identical to queries (e.g., you used the doc title as both query and positive). Add harder pairs, or mine queries from real logs instead.

Loss is unstable, spikes randomly

Batch size too small, or learning rate too high. Increase batch size first. If you're VRAM-limited, use gradient accumulation: gradient_accumulation_steps=4.

Eval metrics improve on test split but regress on production queries

Classic distribution drift. Your training pairs came from one source (say, support tickets) and your real queries look different (in-product search). Re-mine pairs from the source that matches production.

Out-of-memory error during evaluation

The InformationRetrievalEvaluator encodes the full corpus into memory. For corpora over ~50K docs, pass corpus_chunk_size=10000 to the evaluator.

Model performs worse than the base model on general queries

You overfit. Mix in some general-domain pairs (e.g., 10–20% from MS MARCO) during training, or reduce epochs to 1–2. This is the real tradeoff of fine-tuning: you gain domain accuracy and lose some out-of-distribution generalization. For a closed-domain product search, that's usually a fair trade. For an open-domain assistant, it's not.

What this approach is bad at

Honest list, because you'll find these the hard way otherwise:

Multilingual queries. If your users search in Hindi, Arabic, and English, a fine-tuned English-only model will degrade non-English performance. Start from a multilingual base (bge-m3, paraphrase-multilingual-mpnet-base-v2).
Very small corpora (< 200 docs). You won't have enough pairs to fine-tune meaningfully. Stick with off-the-shelf + good re-ranking.
Rapidly changing vocabulary. If your product ships new feature names weekly, you'll need a re-training cadence. Plan for it.
Compliance-heavy domains. If you're in healthcare or finance and need to explain retrieval decisions, a fine-tuned black box is harder to audit than a sparse BM25 baseline.

How CodeNicely can help

We built the RAG and search layer for GimBooks, a YC-backed accounting SaaS where the vocabulary problem was severe: small-business users in India describe invoicing in a mix of English, Hindi-English code-switching, and informal accounting terms (kacha bill, udhaar, party ledger) that no off-the-shelf embedding model handles cleanly. Generic embeddings missed roughly half the obvious queries. The fix was exactly the pipeline above: mining pairs from in-app search logs, fine-tuning a small open-source encoder, and shipping it behind an A/B test against the original index.

If you're sitting on query logs and complaints but don't have the team bandwidth to run the data-mining and evaluation loop, that's a problem we've solved on production traffic. Our AI Studio works with B2B SaaS teams on retrieval, ranking, and embedding-layer fixes — typically as an embedded pod rather than a one-off consultation. See how we work with scale-ups for the engagement model.

Frequently Asked Questions

How many training pairs do I actually need to fine-tune an embedding model?

500 high-quality pairs from real user queries will outperform 5,000 LLM-generated synthetic pairs. Quality dominates quantity. If you can get to 1,500–2,000 mined from logs and support tickets, you're in a comfortable zone for most B2B SaaS knowledge bases.

Should I fine-tune the embedding model or just add a re-ranker?

Re-rankers (Cohere Rerank, BGE reranker) help when retrieval surfaces the right doc in the top 20 but not the top 3. They don't help when retrieval misses the doc entirely — which is the vocabulary-mismatch problem. If your Recall@20 is already high, add a re-ranker. If it's not, fine-tune the embedder first.

Will a fine-tuned open-source model really beat OpenAI's text-embedding-3-large on my domain?

On your domain, frequently yes. On general benchmarks, no. text-embedding-3-large is broader; a fine-tuned bge-base is sharper on your specific vocabulary. The honest answer depends on how distinctive your domain language is. Build the eval set in Step 3 and measure — don't guess.

How often do I need to re-train the embedding model?

Trigger re-training when (a) you ship significant new product surface area with new terminology, or (b) your eval-set metrics drift more than 5 points from launch baseline. For most stable B2B SaaS products, once or twice a year is enough. For rapidly evolving products, quarterly.

What does it cost to do this end-to-end with CodeNicely?