AI/ML technology
Startups AI/ML May 3, 2026 • 8 min read

Your RAG Pipeline Isn't Failing. Your Chunking Strategy Is.

For: A senior engineer at a Series A SaaS startup who shipped a RAG-based knowledge assistant three months ago — it demo'd beautifully but users keep reporting confidently wrong answers, and she has already swapped embedding models twice without improvement

If you've shipped a RAG assistant that demos beautifully and then quietly hallucinates in production, I'd bet against your embedding model being the problem. I'd bet against your re-ranker too. The defect was almost certainly introduced before any of those components saw the data — at the moment you decided how to split your documents into chunks. Chunking is the only step in a RAG pipeline where a structural mistake is completely invisible to every downstream metric. Embeddings look healthy. Retrieval scores look healthy. The LLM sounds confident. And the answer is still wrong, because the context window was handed semantically amputated fragments that no model can reason over correctly.

This is the position I'll defend: chunking is the highest-leverage and most under-engineered step in production RAG. If you've already swapped embedding models twice without improvement, stop tuning retrieval. Go look at what your ingestion pipeline actually produced.

Why chunking failures hide so well

Every other failure mode in a RAG pipeline leaves fingerprints. A bad embedding model produces low cosine similarity for obviously related queries — you can spot it on a held-out set in an afternoon. A bad re-ranker shows up as relevant chunks ranked below junk. A bad prompt produces format violations or refusals you can grep for.

Chunking failures produce none of these signals. Consider what happens when you split a 40-page policy document with a fixed 512-token window and 50-token overlap:

Nothing in your observability stack flags this. Your retrieval recall@k is fine. Your faithfulness score is fine — the LLM did faithfully use the retrieved chunk. The chunk just wasn't the whole truth. This is the core reason why RAG gives wrong answers in production while every dashboard says it's working.

The contrarian claim: most RAG failures are pre-retrieval failures

The popular debugging path for RAG looks like this: swap the embedding model, add a re-ranker, tune top-k, add a query rewriter, add HyDE, fine-tune the embedding model on domain data. Teams burn weeks on this loop. Sometimes it helps. Usually it doesn't, because none of those changes can reconstruct information that was destroyed at ingestion.

If your chunk boundary cut a conditional clause off from its consequent, no embedding model on earth can retrieve a chunk that contains both. The information isn't in any chunk anymore. You're tuning a search engine over a corpus that no longer contains the answer in any single retrievable unit.

This is what makes chunking different from every other knob in the pipeline. Retrieval, re-ranking, and prompting are all reversible — you can change them tomorrow and re-run. Chunking is baked into the index. Fixing it means re-ingesting the corpus, which teams resist because it feels like throwing away work.

Three concrete failure patterns I keep seeing

1. The fixed-window split through a table

A SaaS company indexed their pricing documentation with a 1000-character splitter. Half their pricing tables got cut between the header row and the data rows. The data-row chunks embedded as columns of numbers with no context. Queries about pricing returned plausible-looking number soup. The fix wasn't a better embedding model — it was detecting tables during ingestion and keeping each table as an atomic chunk with its caption attached.

2. The Markdown-naive split of technical docs

Engineering teams love feeding their docs site directly into a generic text splitter. The splitter doesn't know that an h3 heading governs the next 800 words, and it cuts mid-section. Now you have orphan paragraphs that reference "this method" or "the above configuration" with no anchor. The LLM, handed these orphans, invents what the referent must have been. The fix is a structure-aware splitter that respects heading hierarchy and prepends parent headings to every child chunk.

3. The conversation-blind split of support transcripts

Customer support transcripts get chunked by token count. The customer's question ends up in chunk 7. The agent's resolution ends up in chunk 8. Retrieval pulls chunk 7 for a similar future question, and the LLM has no resolution to ground its answer in. The fix is to chunk by conversational turn pair, or by ticket, not by tokens.

The pattern across all three: the right unit of chunking is determined by the document's semantic structure, not by an arbitrary token budget. A document chunking strategy for LLMs needs a schema per document type, not a global default.

What a serious chunking strategy actually looks like

A defensible RAG chunking strategy has four properties:

  1. Structure-aware splitting. Use the document's native boundaries — headings, sections, list items, table rows, conversational turns. Token counts are a constraint, not the primary axis.
  2. Context propagation. Every chunk carries its parents. A chunk under Section 4.2 > Refund Policy > International Orders should embed and retrieve with that breadcrumb attached, not as a naked paragraph.
  3. Atomic semantic units stay atomic. Tables, code blocks, definitions, and conditional rules are never split mid-unit, even if they exceed your token budget. Better to have one oversized chunk than two useless halves.
  4. Per-document-type schemas. A legal contract chunks differently from a runbook. Build a small registry of chunkers, one per document class, and route on ingestion.

The expensive part isn't writing the chunkers. It's building the evaluation harness that proves they work — a labeled set of question/answer pairs where you can verify the answer span lives inside a single retrieved chunk, with its qualifying context intact.

The strongest counter-argument

The honest objection to all of this: structure-aware chunking is more engineering work, and for a corpus of clean, uniform documents, a naive recursive splitter often does fine. If your data is all Confluence pages of similar shape, or all PDFs of the same template, you can probably tune your way to acceptable quality without per-type schemas.

That's true. The argument breaks down the moment your corpus is heterogeneous — which is roughly the moment your product crosses from demo to production. Mixed PDFs, scraped HTML, transcripts, spreadsheets, and Notion exports cannot share one chunker. If you're shipping a knowledge assistant to real users with real documents, you're already past the point where naive splitting is defensible.

The other fair counter: better long-context models reduce the chunking problem. Partly true. But long context doesn't fix retrieval — you still have to decide what to put in the window, and you still pay latency and cost for what you stuff in. Chunking quality determines what's retrievable, which is upstream of context length entirely.

How CodeNicely can help

We've debugged this exact failure mode on production RAG systems. The closest reference is our work with HealthPotli, where the assistant had to reason over drug interaction data, dosage rules, and contraindications — a corpus where chunking errors aren't just embarrassing, they're dangerous. A naive split of a contraindication paragraph from its conditions would produce exactly the confidently-wrong answers your users are reporting. The engagement involved building document-type-specific chunkers, attaching structural metadata to every chunk, and an eval harness that tested whether the answer span and its qualifying clauses survived ingestion intact.

If your team has already swapped embedding models twice and the symptoms haven't moved, the same diagnostic playbook usually applies. Our AI Studio runs RAG audits that start at ingestion, not retrieval — because that's where the defect almost always lives.

What to do Monday morning

Before you touch your embedding model again, run this audit:

  1. Take 20 questions your users got wrong. For each one, find the source document and locate the exact passage that contains the correct answer.
  2. Pull up the chunks in your index that cover that passage. Check whether the answer and its qualifying context (scope, conditions, exceptions) are inside a single chunk.
  3. Count how many of the 20 fail this test. If more than a quarter do, you don't have a retrieval problem. You have a chunking problem, and no amount of re-ranking will fix it.

Then build per-document-type chunkers, propagate structural context, and re-ingest. It's the least glamorous fix in RAG pipeline debugging and almost always the highest-impact one.

Frequently Asked Questions

How do I know if chunking is the problem versus retrieval?

Take failed queries and manually inspect the chunks that should have been retrieved. If the correct answer plus its qualifying context doesn't exist intact in any single chunk, the problem is upstream of retrieval. If the right chunk exists but isn't being returned, then it's a retrieval or ranking issue.

What's the right chunk size for RAG?

There isn't one. Chunk size should be determined by the semantic unit of the document — a section, a table, a conversational turn, a function definition — not a fixed token count. Token limits are a guardrail, not a target. A correct chunker may produce chunks ranging from 100 to 2000 tokens within the same corpus.

Will switching to a long-context model fix my RAG accuracy?

Only partially. Longer context lets you stuff more chunks into the prompt, but you still have to retrieve the right ones, and retrieval quality is determined by chunk quality. Long context also raises latency and cost. It's a complement to good chunking, not a substitute.

Should I use semantic chunking or rule-based chunking?

For most production systems, rule-based structure-aware chunking (respecting headings, tables, lists) is more reliable and debuggable than embedding-based semantic chunking. Semantic chunkers are useful for unstructured prose like transcripts, but they introduce a second model into your ingestion pipeline that itself can fail silently.

How long does it take to re-architect a RAG pipeline's chunking layer?

It depends entirely on corpus heterogeneity, document volume, and existing eval coverage. The re-ingestion itself is usually fast; building the per-type chunkers and the evaluation harness to prove they work is the real effort. Contact CodeNicely for a personalized assessment based on your specific corpus and stack.

Building something in AI/ML?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team