Healthcare technology
Startups Healthcare April 30, 2026 • 8 min read

Vector Search Is Not Semantic Search (And the Difference Costs You)

For: A Series A health-tech product lead who just shipped a symptom-search or drug-lookup feature using a vector database and is puzzled why clinically similar queries return irrelevant results despite the demo working perfectly

Your demo worked. You typed "chest pain after exertion" and the top result was "angina pectoris." Investors clapped. Then a clinician on your advisory board typed "beta blocker contraindicated in asthma" and got back results about beta agonists used to treat asthma. The vector database is doing exactly what it was built to do. The problem is what you assumed it was doing.

Vector search and semantic search are not the same thing. Conflating them is cheap to do and expensive to discover. Here is what is actually happening under the hood, why general-purpose embeddings silently fail in healthcare, and what to change.

The problem vector search was built to solve

Keyword search (BM25, Elasticsearch defaults) breaks the moment users phrase things differently than your documents. A patient writes "my heart is racing." Your drug monograph says "tachycardia." Lexical overlap: zero. Result: nothing useful.

Vector search fixes this by representing text as a list of numbers — an embedding — produced by a neural network. Phrases with similar meaning end up geometrically close in this high-dimensional space. "Heart racing" and "tachycardia" land near each other. You store these vectors in a database (Pinecone, Weaviate, pgvector, Qdrant), and at query time you find the nearest neighbors.

That is vector search. It is a geometry problem. It is not a meaning problem.

One analogy, then we move on

Imagine a librarian who has never read any of the books but has memorized which books tend to be checked out together. Ask for "something like Sapiens" and they will hand you adjacent books on the shelf. Mostly useful. Occasionally they hand you a critique of Sapiens, because critiques sit next to the original in checkout patterns. The librarian is not wrong about geometry. They are just not reasoning about your intent.

That is your embedding model. "Close" is defined by whatever the model was trained to predict — usually next-token likelihood or sentence pair similarity on web text. Not clinical equivalence.

A minimal worked example

Suppose you index three drug descriptions using OpenAI's text-embedding-3-small:

  1. Propranolol — a beta blocker used for hypertension and migraine prophylaxis. Contraindicated in asthma.
  2. Albuterol — a beta-2 agonist used as a rescue inhaler in asthma.
  3. Metoprolol — a cardioselective beta blocker used post-MI.

Query: "safe beta blocker for a patient with mild asthma".

A general-purpose embedding sees "beta," "asthma," "patient," "safe." Albuterol's description shares "beta" and "asthma" — strong cosine similarity. Metoprolol shares "beta blocker" but not "asthma." Depending on the index, albuterol can rank above metoprolol. The geometrically nearest answer is clinically dangerous.

The model never learned that "beta blocker" and "beta agonist" are mechanistically opposite. To it, they are two tokens that often co-occur with cardiopulmonary text.

Where the failure actually lives

When your search returns plausible-but-wrong results, debug in this order:

1. The embedding model

This is the failure point 80% of the time in healthcare. General models (OpenAI, Cohere, all-MiniLM) are trained on web text. They encode that "aspirin" and "ibuprofen" are similar — both NSAID-ish things people Google. They do not reliably encode that one is irreversible COX inhibition and one is reversible, which matters for surgical bleeding risk.

Fix: evaluate domain-tuned embeddings. BioBERT, ClinicalBERT, SapBERT, MedCPT, or BGE fine-tuned on UMLS pairs. Run them against a held-out set of clinically tricky pairs you wrote — not a public benchmark.

2. The chunking strategy

If you embedded an entire drug monograph as one vector, you have averaged contraindications, dosing, and pharmacology into a single point. The vector represents the document's gist, not its specifics. Re-chunk by section. Embed contraindications separately from indications.

3. The approximate nearest neighbor index

HNSW, IVF, and other approximate nearest neighbor search algorithms trade recall for speed. Default settings (e.g. ef_search=40 in HNSW) are tuned for general workloads. In a small corpus of high-stakes data, run exact search and compare. If the approximate index is missing the right answer that exact search finds, raise ef_search or switch to flat indexing. You probably do not have billions of vectors. Stop pretending you do.

4. The query itself

User queries are short and ambiguous. "BP meds" embeds nothing like "antihypertensive agents." Add a query rewriting step — a small LLM call that expands and normalizes the query before embedding. This is where retrieval-augmented generation pipelines earn their keep.

5. The assumption that one score is enough

Cosine similarity gives you a ranked list. It does not give you a confidence threshold. Pair vector search with a reranker (Cohere Rerank, BGE reranker, or a cross-encoder) that actually reads query and candidate together. Rerankers are slower per pair but you only run them on the top 50 candidates. Accuracy gains in clinical retrieval are usually large.

Vector database gotchas worth knowing

When to use vector search — and when not to

Use it when: queries are natural-language and varied, the corpus is large enough that exact lexical matching fails, and the cost of a near-miss is acceptable (FAQ search, document discovery, internal knowledge bases).

Do not use it alone when: wrong answers cause clinical harm, the domain has terms with opposite meanings that look similar (agonist/antagonist, hyper-/hypo-, -emia/-uria), the corpus is small enough for hybrid retrieval, or you need explainable ranking for regulatory review.

For most health-tech features, the right architecture is hybrid: BM25 for precision on known terms, domain-tuned vectors for recall on paraphrase, a reranker on top, and a guardrail layer that catches contraindication-class errors before they reach the user.

How CodeNicely can help

We built HealthPotli, an e-pharmacy with AI-driven drug interaction and substitution logic. The retrieval problem you are describing is exactly what we worked through there: domain-tuned embeddings for drug names and active ingredients, separate indexes for indications versus contraindications, a reranker tuned on pharmacist-labeled pairs, and a hard rule layer for known dangerous combinations that retrieval is never trusted to catch alone. If your team has shipped v1 on a generic embedding stack and is now seeing edge cases that worry your medical advisors, that is the engagement we know how to run. Our AI studio handles the embedding evaluation, retrieval pipeline, and clinical eval harness as one workstream rather than three handoffs.

Frequently Asked Questions

What is the difference between vector search and semantic search?

Vector search is a mechanism: it finds nearest neighbors in embedding space. Semantic search is a goal: returning results that match user intent. Vector search is one way to attempt semantic search, and it works well when the embedding model has learned the distinctions your domain cares about. In specialized fields like healthcare or law, that assumption frequently breaks.

Why does my vector search return clinically wrong results despite high similarity scores?

High cosine similarity means the embedding model thinks two pieces of text are related — usually because they share vocabulary, topic, or co-occurrence patterns in training data. It does not mean they are clinically equivalent. General-purpose models routinely score "beta blocker" and "beta agonist" as highly similar because they appear in similar contexts, even though their pharmacology is opposite.

Should I use a domain-specific embedding model like BioBERT or ClinicalBERT?

Almost always yes for clinical retrieval, but verify on your own evaluation set. Public benchmarks like BEIR or MTEB do not capture the specific failure modes that matter for your product. Build 100–300 query-answer pairs reviewed by a clinician, then compare general and domain models head to head.

Is hybrid search (BM25 + vectors) better than pure vector search?

For most healthcare and regulated-industry use cases, yes. BM25 catches exact term matches that vectors sometimes miss (drug names, ICD codes, specific dosages), and vectors catch paraphrases that BM25 misses. A reranker on top of the merged candidate set typically outperforms either approach alone.

How do I evaluate whether my embedding search is accurate enough to ship?

Build a labeled evaluation set with your clinical advisors before you tune anything. Measure recall@10 and MRR on queries that represent real user intent, including adversarial cases (negation, contraindications, look-alike drugs). If you want help designing that harness for a regulated product, contact CodeNicely for a personalized assessment.

Building something in Healthcare?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team