SaaS technology
Startups SaaS May 11, 2026 • 11 min read

Fine-Tune an Embedding Model on Your Own Docs in 6 Steps

For: A mid-stage B2B SaaS engineer who shipped a RAG feature on top of OpenAI or Cohere embeddings, tuned chunking and re-ranking to exhaustion, and still watches users complain that search misses obvious results — because the embedding model was never trained on their product's vocabulary

You've shipped a RAG feature. You tuned chunk sizes, added a Cohere re-ranker, played with hybrid BM25, and the support channel still pings you with the same complaint: a user searched for MRR churn cohort and got back a doc about monthly billing reconciliation. The retrieval looks confident. It's just wrong.

The defect usually isn't in your pipeline. It's in the representation. A generic embedding model — text-embedding-3-small, embed-english-v3.0, whatever you're using — was trained on web text. It has never seen your product's vocabulary. In a SaaS knowledge base, ARR and annual recurring revenue can sit surprisingly far apart in vector space. Churn and cancellation are not always near-neighbors. Your customers know they're synonyms. Your model doesn't.

This tutorial walks through fine-tuning a small open-source embedding model on your own docs, using contrastive pairs mined from query logs. You don't need a research budget. 500–2,000 good pairs and a single GPU (or even a beefy CPU for a small model) will outperform a larger off-the-shelf model on your domain — and cut your inference bill because you're now serving a 33M or 110M parameter model instead of calling an API.

What you'll need before starting

pip install -U sentence-transformers datasets accelerate

Step 1: Pick a base model honestly

Don't start with the biggest model on the MTEB leaderboard. Start with the smallest one that beats your current OpenAI baseline by < 5 points on your eval set. You're going to fine-tune it; the gap will close.

Reasonable defaults:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-base-en-v1.5")
print(model)

Expected output: a model summary showing the transformer backbone (BertModel) and a pooling layer with output dim 768.

Step 2: Mine training pairs from real signals

This is the step that determines whether the fine-tune works. The model needs to learn what's similar in your domain and what's not. You have four cheap sources for pairs:

  1. Query logs + click-through. If a user searched how to invite teammate and clicked the result titled Adding users to your workspace, that's a positive pair.
  2. Support tickets + resolved doc links. Support agent linked KB article #482 to close the ticket → (ticket_subject, article_482) is a positive pair.
  3. Title ↔ body pairs from your own docs. Free, noisy, but useful as augmentation.
  4. LLM-generated queries. For each doc, prompt GPT-4 or Claude to generate 3 realistic user questions it answers. Filter aggressively.

Save them as a JSONL of {"query": ..., "positive": ...}. Aim for 500 minimum. 2,000 is comfortable.

import json

with open("pairs.jsonl", "w") as f:
    for q, p in mined_pairs:  # your list of (query, positive_passage)
        f.write(json.dumps({"query": q, "positive": p}) + "\n")

Don't bother manually picking negatives. We'll use in-batch negatives via MultipleNegativesRankingLoss, which treats every other positive in the batch as a negative. It's the workhorse loss for this task.

Step 3: Build a real evaluation set before you train

If you can't measure it, don't ship it. Create a held-out set of 50–200 queries with their correct doc IDs. Compute Recall@5 and MRR@10 with your current embedding model first. Write that number down. Tape it to your monitor.

from sentence_transformers.evaluation import InformationRetrievalEvaluator

# queries: {qid: query_text}
# corpus:  {did: doc_text}
# relevant_docs: {qid: set([did, ...])}

evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name="saas-kb-eval",
    mrr_at_k=[10],
    accuracy_at_k=[1, 5],
    precision_recall_at_k=[5, 10],
)

baseline = evaluator(model)
print(baseline)

Expected output: a dict with values like {'saas-kb-eval_cosine_recall@5': 0.62, 'saas-kb-eval_cosine_mrr@10': 0.48, ...}. Your numbers will differ. What matters is having a number.

Step 4: Fine-tune with MultipleNegativesRankingLoss

The sentence-transformers v3 API uses the HuggingFace Trainer. Here's a minimal but real training script:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sentence_transformers.losses import MultipleNegativesRankingLoss

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

dataset = load_dataset("json", data_files="pairs.jsonl", split="train")
dataset = dataset.train_test_split(test_size=0.1, seed=42)

loss = MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="./bge-base-saas",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    logging_steps=20,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_saas-kb-eval_cosine_mrr@10",
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    loss=loss,
    evaluator=evaluator,
)

trainer.train()
model.save_pretrained("./bge-base-saas-final")

A few honest notes on the hyperparameters:

Step 5: Re-evaluate and sanity-check on real queries

Run the evaluator on your fine-tuned model:

tuned = SentenceTransformer("./bge-base-saas-final")
results = evaluator(tuned)
print(results)

A working fine-tune on a domain-mismatched base model typically lifts Recall@5 by 8–20 points and MRR@10 by a similar amount. If you see < 2 points improvement, something's wrong — usually your training pairs are too close to noise (LLM-generated without filtering) or your eval set overlaps with training data.

Then do the eyeball test. Take the 10 queries your team has personally complained about and run them through both models:

queries = ["MRR churn cohort", "invite teammate sso", "webhook retry policy"]
for q in queries:
    q_emb = tuned.encode(q, normalize_embeddings=True)
    scores = (corpus_embeddings @ q_emb).tolist()
    top = sorted(zip(scores, corpus_titles), reverse=True)[:3]
    print(q, "→", top)

You should see results that actually look like answers a human would give. If MRR churn cohort still returns billing reconciliation, your pairs didn't cover that vocabulary. Add 20 more pairs that explicitly bridge those terms and retrain.

Step 6: Ship it without breaking your index

You changed embedding dimensions? No, you didn't — you fine-tuned the same architecture, so dim stays at 768. But the vectors are different. You must re-encode your entire corpus before swapping the model in production.

# Re-encode the corpus with the new model
corpus_embeddings = tuned.encode(
    list(corpus.values()),
    batch_size=64,
    normalize_embeddings=True,
    show_progress_bar=True,
)

# Write to your vector DB (pgvector, Qdrant, Pinecone, etc.)
# Use a new collection / namespace so you can A/B against the old one.

Run both indices in parallel for a week. Route 10% of traffic to the new one. Watch click-through-rate on the top result and zero-result rates. If both move in the right direction, ramp.

Common errors and what they actually mean

Loss goes to zero in the first 50 steps

Your pairs are too easy — usually because positives are near-identical to queries (e.g., you used the doc title as both query and positive). Add harder pairs, or mine queries from real logs instead.

Loss is unstable, spikes randomly

Batch size too small, or learning rate too high. Increase batch size first. If you're VRAM-limited, use gradient accumulation: gradient_accumulation_steps=4.

Eval metrics improve on test split but regress on production queries

Classic distribution drift. Your training pairs came from one source (say, support tickets) and your real queries look different (in-product search). Re-mine pairs from the source that matches production.

Out-of-memory error during evaluation

The InformationRetrievalEvaluator encodes the full corpus into memory. For corpora over ~50K docs, pass corpus_chunk_size=10000 to the evaluator.

Model performs worse than the base model on general queries

You overfit. Mix in some general-domain pairs (e.g., 10–20% from MS MARCO) during training, or reduce epochs to 1–2. This is the real tradeoff of fine-tuning: you gain domain accuracy and lose some out-of-distribution generalization. For a closed-domain product search, that's usually a fair trade. For an open-domain assistant, it's not.

What this approach is bad at

Honest list, because you'll find these the hard way otherwise:

How CodeNicely can help

We built the RAG and search layer for GimBooks, a YC-backed accounting SaaS where the vocabulary problem was severe: small-business users in India describe invoicing in a mix of English, Hindi-English code-switching, and informal accounting terms (kacha bill, udhaar, party ledger) that no off-the-shelf embedding model handles cleanly. Generic embeddings missed roughly half the obvious queries. The fix was exactly the pipeline above: mining pairs from in-app search logs, fine-tuning a small open-source encoder, and shipping it behind an A/B test against the original index.

If you're sitting on query logs and complaints but don't have the team bandwidth to run the data-mining and evaluation loop, that's a problem we've solved on production traffic. Our AI Studio works with B2B SaaS teams on retrieval, ranking, and embedding-layer fixes — typically as an embedded pod rather than a one-off consultation. See how we work with scale-ups for the engagement model.

Frequently Asked Questions

How many training pairs do I actually need to fine-tune an embedding model?

500 high-quality pairs from real user queries will outperform 5,000 LLM-generated synthetic pairs. Quality dominates quantity. If you can get to 1,500–2,000 mined from logs and support tickets, you're in a comfortable zone for most B2B SaaS knowledge bases.

Should I fine-tune the embedding model or just add a re-ranker?

Re-rankers (Cohere Rerank, BGE reranker) help when retrieval surfaces the right doc in the top 20 but not the top 3. They don't help when retrieval misses the doc entirely — which is the vocabulary-mismatch problem. If your Recall@20 is already high, add a re-ranker. If it's not, fine-tune the embedder first.

Will a fine-tuned open-source model really beat OpenAI's text-embedding-3-large on my domain?

On your domain, frequently yes. On general benchmarks, no. text-embedding-3-large is broader; a fine-tuned bge-base is sharper on your specific vocabulary. The honest answer depends on how distinctive your domain language is. Build the eval set in Step 3 and measure — don't guess.

How often do I need to re-train the embedding model?

Trigger re-training when (a) you ship significant new product surface area with new terminology, or (b) your eval-set metrics drift more than 5 points from launch baseline. For most stable B2B SaaS products, once or twice a year is enough. For rapidly evolving products, quarterly.

What does it cost to do this end-to-end with CodeNicely?

It depends on the size of your corpus, the quality of your existing query logs, and whether you need ongoing re-training infrastructure or a one-time lift. Contact CodeNicely for a personalized assessment and we'll scope it against your actual data.

Building something in SaaS?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team