AI/ML technology
Startups AI/ML May 5, 2026 • 9 min read

LangChain vs. LlamaIndex vs. Raw API: Pick One

For: A seed-to-Series-A CTO who is three days into prototyping an LLM feature and has just discovered that their LangChain abstraction is hiding a bug they cannot locate — and is now questioning whether the framework is helping or hurting

If you're reading this, you probably have a stack trace open in another tab. Some chain inside LangChain is swallowing an exception, or a retriever is returning empty results, and the abstraction that was supposed to save you a week is now costing you a weekend. You are asking the right question: keep the framework, switch frameworks, or rip it out and call the API directly?

Here is the honest answer most comparison posts won't give you: the token overhead and learning curve are not the real cost of these frameworks. The real cost is that their abstractions leak under edge cases in ways that force you to understand the underlying API anyway. You end up paying the abstraction tax without escaping the raw-API complexity. Teams that ship fastest don't pick the framework with the best quickstart — they pick based on where in the stack they want to own the complexity.

This post compares the three real options — LangChain, LlamaIndex, and a thin raw-API wrapper — on the dimensions that actually matter once you're past the demo: debugging cost, abstraction leakage, lock-in, and production fit.

The three options, briefly

LangChain is an orchestration framework. It models your application as chains and agents — sequences of LLM calls, tool invocations, memory, and routing logic. Its sweet spot is multi-step workflows where the steps change. LangGraph (its newer state-machine layer) is the part most teams should actually be evaluating in 2024-onwards, not the legacy chain APIs.

LlamaIndex is a retrieval and indexing framework. It models your application as data → index → query. Its sweet spot is RAG: ingesting documents, chunking them sensibly, building vector or hybrid indexes, and serving grounded answers. It overlaps with LangChain (it has agents now too), but its center of gravity is retrieval quality.

Raw API means a thin wrapper around the OpenAI / Anthropic / Bedrock SDK. You write the prompt orchestration, the retry logic, the retrieval, and the tool-calling glue yourself. You probably also write a small Pydantic layer for structured outputs and a logging hook for observability.

Head-to-head: the dimensions that matter

DimensionLangChain (LangGraph)LlamaIndexRaw API wrapper
Best forMulti-step agents, tool use, stateful workflowsRAG over your own documents, hybrid retrievalSingle-purpose features, well-defined I/O
Time to first demoHoursHoursA day or two
Time to production-gradeSlower than it looks — abstractions leakFaster than LangChain for pure RAGSlower upfront, predictable later
Debugging costHigh. Stack traces go through layers of wrappersMedium. Cleaner than LangChain, still indirectLow. The bug is in code you wrote
Abstraction leakageSignificant — you'll read framework sourceModerate — mostly around chunking and retrieversNone by definition
Lock-inHigh. Migrating off chains is rewritingMedium. Indexes are portable, query layer isn'tLow. SDK calls map 1:1 across providers
ObservabilityLangSmith is genuinely good (paid)Decent built-in tracingYou wire your own (Langfuse, Helicone, OTel)
Streaming, tool calls, structured outputSupported but lags provider featuresSupported, less centralAlways current — you're calling the SDK
Team size that benefits3+ engineers, multiple workflows1-3 engineers shipping RAG1-2 engineers, focused scope

Where each option actually breaks

LangChain breaks at the abstraction boundary

The classic failure mode: you build a chain in an afternoon, it works on five test inputs, you ship it, and then a user sends an input that triggers a tool call that returns malformed JSON, and the error surfaces three layers deep inside AgentExecutor with a message that doesn't tell you which prompt produced it. You open the framework source, trace through the wrappers, find that a default handle_parsing_errors=True is swallowing the real exception, and now you're a LangChain contributor whether you wanted to be or not.

This is not a slander — it's the price of generality. LangChain has to support dozens of LLM providers, vector stores, and tool formats. The cost of that flexibility is indirection. LangGraph improves the situation significantly by making state explicit, but you still pay an abstraction tax on every primitive.

LangChain is right when: you have multiple agents, tool-using workflows, or stateful conversations, and you have at least one engineer who is willing to read the framework source when things go wrong. LangGraph + LangSmith is a real combo for production agent systems.

LangChain is wrong when: you have a single, well-defined LLM feature (summarize this, classify that, extract these fields). You will spend more time fighting the framework than you would have spent writing the 200 lines it replaces.

LlamaIndex breaks at the retrieval edge cases

LlamaIndex gets you to a working RAG demo faster than anything else. The defaults — sentence-window chunking, top-k vector retrieval, a basic re-ranker — are sensible. The problem starts when retrieval quality plateaus and you need to debug why the right chunk isn't being returned for a specific query.

At that point you'll discover that the abstraction has opinions about chunk metadata, node relationships, and query transformations that you now need to learn in detail. The good news: LlamaIndex's abstractions are thinner than LangChain's, and the retrieval layer is genuinely well-designed. The bad news: hybrid retrieval, query rewriting, and re-ranking tuning still require you to understand what's happening under the hood.

LlamaIndex is right when: RAG is the core of the feature, you have non-trivial document structure (PDFs with tables, hierarchical docs, code), and you want strong defaults you can graduate from.

LlamaIndex is wrong when: your retrieval is simple (a few hundred docs, flat structure) — pgvector + a 50-line query function will outperform it on debuggability. It's also wrong as a general agent framework; that's not what it's optimized for.

Raw API breaks at the boring stuff

Calling openai.chat.completions.create directly is the most underrated option for early-stage features. You get the latest provider features the day they ship — structured outputs, prompt caching, vision, the new tool-calling formats. There is no version mismatch between what the framework supports and what the API offers.

What you give up: retries with exponential backoff, token counting, prompt templates, conversation memory, RAG plumbing, tracing — every one of which is a weekend of work, and none of which is intellectually interesting. If you build five LLM features this way, you are slowly building your own internal LangChain, and yours will be worse than the real one.

Raw API is right when: the feature is one or two LLM calls with clear inputs and outputs (classification, extraction, summarization, single-turn chat). It's also right when you've already chosen your provider and don't expect to switch.

Raw API is wrong when: you'll have more than ~3 distinct LLM features, multi-provider support is a real requirement, or you need agent-style tool use. At that point you're rebuilding LangGraph from scratch.

The decision framework

Skip the feature matrix. Ask three questions in order:

  1. Is this primarily RAG? If retrieval over your own documents is the core thing the feature does, start with LlamaIndex. Drop down to raw pgvector + your own query code only if the corpus is small and flat.
  2. Does this feature involve more than one LLM call with branching, tools, or state? If yes, LangGraph (not legacy LangChain chains). If no, skip to the next question.
  3. Single, well-defined LLM call with structured input/output? Raw API. Don't import a framework to wrap one function.

The mistake almost every team makes: they pick the framework first, then design the feature to fit. Reverse it. Sketch the feature on a whiteboard — inputs, calls, retrievals, outputs — and the right tool becomes obvious.

The lock-in question nobody asks

If you build on LangChain and the project's API churns again (it has, twice), or if a better framework appears in 18 months, what does your migration look like? For LangChain agents, the answer is "rewrite." The chain abstractions don't map cleanly onto anything else.

For LlamaIndex, your indexes are portable (they're just vectors and metadata), but your query pipeline isn't. For raw API code, migration means swapping the SDK import and adjusting a few request shapes.

This isn't an argument against frameworks — it's an argument for being honest about what you're buying. The teams that get burned are the ones who treat "we use LangChain" as a permanent architectural decision instead of a phase.

How CodeNicely can help

Most of the LLM features we ship for clients end up as a mix: raw API for the simple stuff, LlamaIndex or pgvector for retrieval, and LangGraph only when there's genuine multi-step orchestration. We picked that pattern the hard way, by debugging the same leaky abstractions you are right now.

The closest reference for this kind of decision is our work on HealthPotli, where we built an AI drug interaction layer for an e-pharmacy. The naive approach would have been a LangChain agent with a medical knowledge tool. What actually shipped was a tightly scoped retrieval pipeline with a thin raw-API call on top — because the feature had clear inputs (a cart of medications), a clear retrieval need (interaction data), and a clear output (flagged interactions with explanations). A heavier framework would have added debugging surface area without adding capability.

If you're at the stage where the framework choice is starting to feel load-bearing, our AI Studio team does architecture reviews that focus specifically on this — what to keep thin, what to abstract, and what to rip out before it calcifies. We work with seed and Series A teams often enough that we've seen most of the failure modes.

Frequently Asked Questions

Should I use LangChain or call the OpenAI API directly?

If your feature is one or two LLM calls with well-defined inputs and outputs, call the API directly. If you have multi-step workflows, tool use, or stateful agents, LangGraph (LangChain's newer state-machine layer) earns its weight. The middle case — three or four sequential calls with simple branching — is where teams over-adopt LangChain. A 200-line orchestrator is usually clearer.

When should I use LangChain vs LlamaIndex?

LangChain (specifically LangGraph) is for orchestration: agents, tool use, multi-step workflows. LlamaIndex is for retrieval: ingesting documents, building indexes, serving grounded answers. They overlap, but use each for its center of gravity. If your feature is mostly RAG, LlamaIndex will get you to good retrieval faster. If it's mostly agent behavior, LangGraph will give you better state management.

Is LangChain production-ready?

LangGraph and LangSmith are. Legacy LangChain chain APIs are usable but have churned enough that we'd avoid building new systems on them. The honest production picture: teams running LangChain at scale are running LangGraph with LangSmith for observability, not the older LLMChain / AgentExecutor patterns.

What's the real cost of LLM orchestration frameworks?

It's not tokens or latency — those are small. The real cost is debugging time when abstractions leak, and migration cost if you outgrow the framework. Both are hard to estimate upfront, which is why most teams underweight them when picking a framework.

Can CodeNicely help us choose and implement the right LLM stack?

Yes. We do architecture reviews and full builds across LLM orchestration, RAG pipelines, and agent systems. For scoping and a personalized assessment of your specific pipeline, contact CodeNicely with a short description of the feature and what you've already tried.


Pick the option that matches where you want to own the complexity. If you want to own the orchestration logic, write it. If you want to own the retrieval quality, use LlamaIndex. If you want to own the agent state machine, use LangGraph. The framework that hides the part you actually need to control is the wrong framework — no matter how many GitHub stars it has.

Building something in AI/ML?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team