Enterprises AI/ML May 1, 2026 • 6 min read

LLM Latency Cheatsheet: Where 800ms Actually Goes

For: A product engineer at a Series B SaaS company who just shipped an LLM-powered feature and is getting complaints that it feels 'slow' — they can see the total response time in their APM dashboard but cannot tell whether the bottleneck is the model, the prompt, the retrieval layer, or their own infrastructure

You shipped the feature. Users say it feels slow. Your APM shows 800ms end-to-end, but that single number tells you almost nothing about which layer to fix. This is a reference for breaking down LLM latency by stage, with the interventions that actually work at each layer — and the ones that look productive but don't move user-perceived speed.

The core distinction to internalize before anything else: time-to-first-token (TTFT) and total generation time are different problems. Optimizing one rarely helps the other. Conflating them is why latency work often feels invisible to users.

The latency budget, layer by layer

A typical RAG-style request to a hosted LLM. Numbers below are realistic ranges for production systems, not benchmarks — your mileage will vary.

Stage	Typical share of wall clock	What's happening	Counts toward
Client → your edge	20–80ms	TLS, geo routing	TTFT
Auth, rate limit, request validation	5–40ms	Middleware, DB lookups	TTFT
Embedding the user query	30–150ms	Embedding model call	TTFT
Vector search	10–200ms	ANN index, filters, reranking	TTFT
Prompt construction	5–50ms	Templating, token counting, truncation	TTFT
Your server → LLM provider	30–120ms	Network + provider queue	TTFT
LLM prefill (prompt processing)	100–800ms	Model reads input tokens	TTFT
LLM decode (generation)	20–80ms per ~50 tokens	Streaming output tokens	Total only
Post-processing, JSON repair, guardrails	10–300ms	Validation, retries	Total only

Note that the model itself is often a minority of TTFT. Retrieval, network, and prefill on long prompts dominate.

TTFT vs total latency: which matters when

UX pattern	What users feel	Optimize
Streaming chat	TTFT — the moment text starts	TTFT
Non-streaming JSON output (function calls, structured extraction)	Total latency	Total
Background jobs, async pipelines	Throughput, not latency	Cost per token
Voice agents	TTFT, aggressively	TTFT below 500ms ideally
Autocomplete / inline suggest	Total — must finish before user moves on	Both, with a hard cap

How to actually measure it

Add spans for each layer. Generic APM won't do this for you out of the box.

Wrap your embedding call with its own span. Tag the model name and input token count.
Wrap vector search separately from any reranking step. Rerankers are often the surprise cost.
Log prompt length in tokens, not characters. Prefill scales with tokens.
Measure TTFT explicitly: timestamp the first SSE chunk from the provider, not the end of the response.
Capture provider-side metrics if available. OpenAI returns timing headers; Anthropic and Azure expose similar. Compare against your own clock to find network gap.

If you only remember one thing: total_latency - ttft is your decode time. ttft - sum(your_spans) is provider-side prefill plus network. These two numbers tell you which half of the problem you have.

Interventions ranked by impact on TTFT

Fix	Typical TTFT impact	Tradeoff
Stream responses (if you aren't)	Massive — perceived TTFT drops to first token	Harder to validate output mid-stream; JSON streaming is awkward
Shorten the prompt (fewer retrieved chunks, tighter system prompt)	Large — prefill scales with input tokens	May reduce answer quality; needs eval harness
Smaller / faster model for the first hop	Large	Quality drop, possibly need a router or fallback
Co-locate your server with the LLM provider region	30–100ms	Vendor lock to a region
Cache embeddings for repeated queries	30–150ms on hits	Cache invalidation when corpus updates
Replace cross-encoder reranker with a lighter one or skip it	50–200ms	Retrieval quality may drop
Prompt caching (Anthropic, OpenAI, Gemini)	Large for long shared prefixes	Only helps when prefix is stable; minimums apply
Speculative decoding / smaller draft model	Provider-dependent	Mostly out of your hands on hosted APIs

Interventions for total latency (decode-bound)

Cap max_tokens aggressively. Models will fill the space you give them. A 500-token cap on a summary is usually plenty.
Ask for terse output in the system prompt. "Respond in under 80 words." This works.
Prefer structured output over prose when downstream code is the consumer. Less text to generate.
Parallelize independent calls. If you make two LLM calls sequentially and they don't depend on each other, you're paying decode time twice.
Skip the "reasoning out loud" pattern in production unless you actually need it. Chain-of-thought generation is decode time you're paying for.

Things that look like fixes but usually aren't

Switching cloud providers rarely moves the needle unless you're cross-region. Measure first.
Upgrading your vector DB when search is 30ms of an 800ms budget. Fix the bigger spans first.
Adding more retrieval (more chunks, hybrid search, multi-query) almost always increases TTFT. Make sure quality wins justify it.
Microservice splits for the LLM layer. Each hop adds a network round trip. Latency-sensitive paths want fewer hops, not more.

A practical triage order

Are you streaming? If not, start there.
Plot TTFT and total latency separately. Are users complaining about the first byte or the last?
Instrument every span. Find the top two contributors.
For TTFT: shorten input tokens, enable prompt caching, co-locate.
For total: cap output tokens, tighten prompts, parallelize.
Only then consider model swaps or infra changes.

Most teams we've worked with through our AI engineering practice find that the first round of instrumentation alone reshuffles their priority list. The bottleneck is rarely where the team assumed it was — it's usually retrieval, prefill on bloated prompts, or a sequential call chain that could be parallel.

Frequently Asked Questions

Why is my LLM response slow even though the model benchmark shows fast inference?

Public benchmarks measure model inference in isolation, usually on short prompts with warm caches. Production latency includes embedding the query, vector search, prompt construction, network hops to the provider, and prefill on prompts that may be 10x longer than the benchmark. The model itself is often less than half of your wall-clock time.

What's a good time-to-first-token target for a chat UI?

Under 1 second feels responsive when streaming; under 500ms feels instant. Voice agents need sub-500ms TTFT to avoid awkward pauses. For non-streaming structured output, target total latency under 2 seconds for interactive use; beyond that, move it to an async pattern with a loading state.

Does prompt caching actually help in production?

Yes, when you have a long stable prefix — a large system prompt, a fixed set of tool definitions, or a shared document context across requests. Anthropic, OpenAI, and Gemini all expose it now. It does not help when every request has a unique prompt, and minimum prefix sizes apply, so check the provider docs before assuming a win.

Should I run my own model to reduce latency?

Sometimes. Self-hosting can eliminate provider queue time and give you control over batching and quantization, but you trade that for capacity planning, GPU costs, and ops overhead. It's usually only worth it at scale, with predictable traffic, or when data residency requires it. If you're considering it for a specific workload, talk to CodeNicely for a personalized assessment.

How do I know if my retrieval layer is the bottleneck?

Instrument the embedding call and vector search as separate spans. If their combined time is more than 150–200ms and your TTFT is high, that's your fix. Common culprits: a cross-encoder reranker on every query, no embedding cache for repeated inputs, or fetching far more chunks than the model actually needs.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.