Retrieval-Augmented Generation: What It Is and When It Breaks
For: A Series A SaaS founder who just watched a demo of a competitor's AI-powered knowledge assistant and is now asking their CTO whether to build the same thing — they have heard 'RAG' three times in the last week and still cannot explain to their board what it actually does or why it sometimes gives confident wrong answers
Your competitor shipped an AI assistant that answers customer questions from their docs. Your board saw the demo. Now you are in a Slack thread with your CTO trying to decide if you should build the same thing, and someone keeps saying "RAG" like it settles the argument. It does not. RAG is a specific architectural choice with specific failure modes, and most teams that ship it discover the failure modes in production instead of in design review.
This post explains retrieval augmented generation in the terms you actually need to make a build decision: what problem it solves, how it works, where it breaks, and when you should pick something else.
The problem RAG solves
Large language models know what was in their training data. They do not know your company's refund policy, last week's release notes, or the support ticket a customer filed yesterday. If you ask GPT-4 about your internal pricing tiers, it will either say it does not know or — worse — guess plausibly.
You have two ways to fix this. You can retrain or fine-tune the model on your data, which is expensive, slow to update, and bakes information into weights you cannot easily audit. Or you can do something simpler: before the model answers, fetch the relevant documents and paste them into the prompt. That second approach is RAG.
The whole architecture is in the name. Retrieval: find the right snippets. Augmented: stuff them into the context window. Generation: let the LLM write an answer grounded in what you just gave it.
An analogy that actually holds up
Think of an LLM as a smart but overconfident new hire. They have read a lot, but they have not read your wiki. Fine-tuning is sending them to a six-month training program on your company. RAG is handing them the relevant Confluence page two seconds before they walk into the meeting.
The catch: if you hand them the wrong page, they will read it confidently and answer wrong. They will not say "this does not look right." That is the part most explanations skip, and it is the part that matters.
How RAG actually works, minimally
A working RAG pipeline has four stages. Strip away the jargon and it looks like this:
- Chunk and embed your documents. Split your knowledge base into passages (a few hundred tokens each). Run each chunk through an embedding model — OpenAI's
text-embedding-3, Cohere'sembed-v3, or an open model likebge-large— which converts text into a vector of numbers. Store those vectors in a database like Pinecone, Weaviate, pgvector, or Qdrant. - Embed the user's question with the same model. You now have a vector for the question.
- Retrieve. Run a similarity search — usually cosine similarity — to find the top K chunks whose vectors are closest to the question vector. K is typically 3 to 10.
- Generate. Build a prompt that says something like: "Answer the question using only the context below. Context: [chunks]. Question: [user input]." Send it to the LLM.
That is it. Everything else — reranking, hybrid search, query rewriting, metadata filtering — is optimization on top of those four steps.
Where it breaks in production
Here is the insight most teams learn the painful way: RAG does not make an LLM more accurate. It shifts the failure mode from hallucination to retrieval error.
Without RAG, a model with no relevant knowledge will either refuse to answer or make something up. With RAG, the model has been handed three documents and told they are the answer. If those documents are wrong, stale, or off-topic, the model will confidently synthesize an answer from them. The user has no way to tell the difference between a grounded answer and a fluently wrong one.
The common failure patterns:
- Retrieval returns the wrong chunks. Embedding models capture semantic similarity, not factual relevance. "How do I cancel my subscription?" might retrieve a marketing page about plan benefits because it talks about subscriptions a lot.
- The right answer is split across chunks. Your chunk size is 500 tokens. The answer requires combining three paragraphs that landed in different chunks. Retrieval grabs one. The model answers from a third of the truth.
- Stale data. Someone updated the pricing page. The vector index still has last quarter's version. The model cites confidently.
- Adversarial or ambiguous queries. "What is the refund policy for enterprise customers in Germany after July 2024?" requires filtering, not just similarity. Pure vector search will return whatever is vaguely about refunds.
- The model ignores the context. Yes, this happens. If the retrieved chunks contradict what the LLM was trained on, it sometimes hallucinates the trained answer anyway. Prompt engineering helps. It does not eliminate it.
- Citation drift. The model cites source A but actually paraphrases source B. Users who trust the citation are now misled with a footnote.
None of these are exotic edge cases. They are the median behavior of an unoptimized RAG system on real user queries. Teams who build production RAG without an evaluation harness — a fixed set of questions with known-good answers, run on every change — find out about these failures from customer complaints.
RAG vs fine-tuning: when each one wins
The framing "RAG vs fine-tuning" is mostly wrong. They solve different problems.
| Use RAG when | Use fine-tuning when |
|---|---|
| The knowledge changes frequently | You need to change the model's style, tone, or format |
| You need citations and auditability | You need to teach a new task or reasoning pattern |
| Your data is large and structured (docs, tickets, code) | Your data is small but you need it baked in |
| You need to update knowledge without retraining | Latency is critical and you cannot afford retrieval overhead |
In practice, serious systems combine them: fine-tune the model on your domain's vocabulary and response format, then use RAG to inject current facts at query time. Most Series A teams should start with RAG only. Fine-tuning is a second-year problem.
When you should not use RAG at all
RAG is the default answer in 2024, which means it is also the wrong answer more often than people admit. Skip it if:
- Your corpus is small enough to fit in the context window. Modern models handle 100k+ tokens. If your entire knowledge base is 50 pages, just paste it in. You will get better answers with less infrastructure.
- Your queries are structured. "Show me all invoices over $10k from Q3" is a SQL query, not a vector search. Do not embed your way out of a database problem.
- You need deterministic answers. Compliance disclosures, legal language, medical dosages — anywhere a wrong-but-plausible answer is dangerous, RAG adds risk. Use templated responses with the LLM only for formatting, if at all.
- Your users would be better served by search. Sometimes people want to read the source. A good search box with snippets beats a chatbot that summarizes badly.
We have seen this pattern repeatedly in production work — for example in healthcare contexts like the HealthPotli drug interaction system, where deterministic rule-based logic handles the parts where being wrong is unacceptable, and AI handles the parts where it adds genuine value. The architecture decision is usually "where does each technique belong" rather than "which one wins."
What a serious RAG build actually requires
If you decide to build, the parts that matter are not the ones in the tutorials:
- An eval set. 50-200 real questions with known answers. Run it on every change. Without this you are flying blind.
- Hybrid retrieval. Combine vector search with keyword search (BM25). Pure vector misses exact-match queries like product names and error codes.
- A reranker. A second-stage model (Cohere Rerank, BGE reranker) that re-scores the top 20 results down to top 5. This is the highest-leverage single improvement in most RAG systems.
- Metadata filtering. Tag chunks with source, date, customer tier, product. Filter before similarity search.
- Observability. Log every query, every retrieved chunk, every generated answer. You cannot debug what you cannot see.
- A way to say "I do not know." Prompt the model to refuse when the retrieved context does not support an answer. Then measure how often it actually refuses when it should.
This is the boring infrastructure work that separates a demo from a product. It is also where most internal RAG projects stall. If you are weighing whether to staff this in-house or get help, our AI Studio page outlines how we typically structure these builds.
The bottom line for your board
RAG is a real technique that solves a real problem: giving an LLM access to information it was not trained on. It is not magic, it is not a moat, and it is not a substitute for thinking carefully about where AI belongs in your product.
The right question is not "should we add RAG?" It is "what specific user job is failing today, and is retrieval-augmented generation the cheapest reliable way to fix it?" If you cannot answer the second question in two sentences, you are not ready to build it yet.
Frequently Asked Questions
Is RAG better than fine-tuning for a startup?
For most Series A teams, yes — start with RAG. Fine-tuning requires labeled data, ML expertise, and a retraining pipeline. RAG lets you update knowledge by re-indexing documents. Fine-tuning becomes useful later, usually for style and format control, not for facts.
Why does RAG sometimes give confident wrong answers?
Because the LLM trusts whatever you put in its context. If retrieval returns the wrong document, the model will fluently generate an answer based on it. The fix is not better prompts — it is better retrieval, reranking, and an evaluation harness that catches these failures before users do.
What is the difference between RAG and vector search?
Vector search is one component of RAG. Vector search finds similar documents. RAG uses that retrieval step plus an LLM to generate a natural-language answer grounded in the retrieved documents. You can have vector search without RAG (just search) but not RAG without some form of retrieval.
How much does it cost to build a production RAG system?
It depends entirely on data volume, query patterns, latency requirements, and how much custom retrieval logic you need. The infrastructure pieces are cheap; the engineering to make it reliable is not. Contact CodeNicely for a personalized assessment based on your specific use case.
Can RAG replace a search bar in our product?
Sometimes, but not always. If users want to find a document, a good search interface beats a chatbot. If users want an answer synthesized from multiple documents, RAG wins. The honest answer is most products need both — search for browsing, RAG for question-answering — and the design question is which one is the front door.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)