AI/ML technology
Enterprises AI/ML May 1, 2026 • 6 min read

LLM Latency Cheatsheet: Where 800ms Actually Goes

For: A product engineer at a Series B SaaS company who just shipped an LLM-powered feature and is getting complaints that it feels 'slow' — they can see the total response time in their APM dashboard but cannot tell whether the bottleneck is the model, the prompt, the retrieval layer, or their own infrastructure

You shipped the feature. Users say it feels slow. Your APM shows 800ms end-to-end, but that single number tells you almost nothing about which layer to fix. This is a reference for breaking down LLM latency by stage, with the interventions that actually work at each layer — and the ones that look productive but don't move user-perceived speed.

The core distinction to internalize before anything else: time-to-first-token (TTFT) and total generation time are different problems. Optimizing one rarely helps the other. Conflating them is why latency work often feels invisible to users.

The latency budget, layer by layer

A typical RAG-style request to a hosted LLM. Numbers below are realistic ranges for production systems, not benchmarks — your mileage will vary.

StageTypical share of wall clockWhat's happeningCounts toward
Client → your edge20–80msTLS, geo routingTTFT
Auth, rate limit, request validation5–40msMiddleware, DB lookupsTTFT
Embedding the user query30–150msEmbedding model callTTFT
Vector search10–200msANN index, filters, rerankingTTFT
Prompt construction5–50msTemplating, token counting, truncationTTFT
Your server → LLM provider30–120msNetwork + provider queueTTFT
LLM prefill (prompt processing)100–800msModel reads input tokensTTFT
LLM decode (generation)20–80ms per ~50 tokensStreaming output tokensTotal only
Post-processing, JSON repair, guardrails10–300msValidation, retriesTotal only

Note that the model itself is often a minority of TTFT. Retrieval, network, and prefill on long prompts dominate.

TTFT vs total latency: which matters when

UX patternWhat users feelOptimize
Streaming chatTTFT — the moment text startsTTFT
Non-streaming JSON output (function calls, structured extraction)Total latencyTotal
Background jobs, async pipelinesThroughput, not latencyCost per token
Voice agentsTTFT, aggressivelyTTFT below 500ms ideally
Autocomplete / inline suggestTotal — must finish before user moves onBoth, with a hard cap

How to actually measure it

Add spans for each layer. Generic APM won't do this for you out of the box.

If you only remember one thing: total_latency - ttft is your decode time. ttft - sum(your_spans) is provider-side prefill plus network. These two numbers tell you which half of the problem you have.

Interventions ranked by impact on TTFT

FixTypical TTFT impactTradeoff
Stream responses (if you aren't)Massive — perceived TTFT drops to first tokenHarder to validate output mid-stream; JSON streaming is awkward
Shorten the prompt (fewer retrieved chunks, tighter system prompt)Large — prefill scales with input tokensMay reduce answer quality; needs eval harness
Smaller / faster model for the first hopLargeQuality drop, possibly need a router or fallback
Co-locate your server with the LLM provider region30–100msVendor lock to a region
Cache embeddings for repeated queries30–150ms on hitsCache invalidation when corpus updates
Replace cross-encoder reranker with a lighter one or skip it50–200msRetrieval quality may drop
Prompt caching (Anthropic, OpenAI, Gemini)Large for long shared prefixesOnly helps when prefix is stable; minimums apply
Speculative decoding / smaller draft modelProvider-dependentMostly out of your hands on hosted APIs

Interventions for total latency (decode-bound)

Things that look like fixes but usually aren't

A practical triage order

  1. Are you streaming? If not, start there.
  2. Plot TTFT and total latency separately. Are users complaining about the first byte or the last?
  3. Instrument every span. Find the top two contributors.
  4. For TTFT: shorten input tokens, enable prompt caching, co-locate.
  5. For total: cap output tokens, tighten prompts, parallelize.
  6. Only then consider model swaps or infra changes.

Most teams we've worked with through our AI engineering practice find that the first round of instrumentation alone reshuffles their priority list. The bottleneck is rarely where the team assumed it was — it's usually retrieval, prefill on bloated prompts, or a sequential call chain that could be parallel.

Frequently Asked Questions

Why is my LLM response slow even though the model benchmark shows fast inference?

Public benchmarks measure model inference in isolation, usually on short prompts with warm caches. Production latency includes embedding the query, vector search, prompt construction, network hops to the provider, and prefill on prompts that may be 10x longer than the benchmark. The model itself is often less than half of your wall-clock time.

What's a good time-to-first-token target for a chat UI?

Under 1 second feels responsive when streaming; under 500ms feels instant. Voice agents need sub-500ms TTFT to avoid awkward pauses. For non-streaming structured output, target total latency under 2 seconds for interactive use; beyond that, move it to an async pattern with a loading state.

Does prompt caching actually help in production?

Yes, when you have a long stable prefix — a large system prompt, a fixed set of tool definitions, or a shared document context across requests. Anthropic, OpenAI, and Gemini all expose it now. It does not help when every request has a unique prompt, and minimum prefix sizes apply, so check the provider docs before assuming a win.

Should I run my own model to reduce latency?

Sometimes. Self-hosting can eliminate provider queue time and give you control over batching and quantization, but you trade that for capacity planning, GPU costs, and ops overhead. It's usually only worth it at scale, with predictable traffic, or when data residency requires it. If you're considering it for a specific workload, talk to CodeNicely for a personalized assessment.

How do I know if my retrieval layer is the bottleneck?

Instrument the embedding call and vector search as separate spans. If their combined time is more than 150–200ms and your TTFT is high, that's your fix. Common culprits: a cross-encoder reranker on every query, no embedding cache for repeated inputs, or fetching far more chunks than the model actually needs.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.