LLM Latency Cheatsheet: Where 800ms Actually Goes
For: A product engineer at a Series B SaaS company who just shipped an LLM-powered feature and is getting complaints that it feels 'slow' — they can see the total response time in their APM dashboard but cannot tell whether the bottleneck is the model, the prompt, the retrieval layer, or their own infrastructure
You shipped the feature. Users say it feels slow. Your APM shows 800ms end-to-end, but that single number tells you almost nothing about which layer to fix. This is a reference for breaking down LLM latency by stage, with the interventions that actually work at each layer — and the ones that look productive but don't move user-perceived speed.
The core distinction to internalize before anything else: time-to-first-token (TTFT) and total generation time are different problems. Optimizing one rarely helps the other. Conflating them is why latency work often feels invisible to users.
The latency budget, layer by layer
A typical RAG-style request to a hosted LLM. Numbers below are realistic ranges for production systems, not benchmarks — your mileage will vary.
| Stage | Typical share of wall clock | What's happening | Counts toward |
|---|---|---|---|
| Client → your edge | 20–80ms | TLS, geo routing | TTFT |
| Auth, rate limit, request validation | 5–40ms | Middleware, DB lookups | TTFT |
| Embedding the user query | 30–150ms | Embedding model call | TTFT |
| Vector search | 10–200ms | ANN index, filters, reranking | TTFT |
| Prompt construction | 5–50ms | Templating, token counting, truncation | TTFT |
| Your server → LLM provider | 30–120ms | Network + provider queue | TTFT |
| LLM prefill (prompt processing) | 100–800ms | Model reads input tokens | TTFT |
| LLM decode (generation) | 20–80ms per ~50 tokens | Streaming output tokens | Total only |
| Post-processing, JSON repair, guardrails | 10–300ms | Validation, retries | Total only |
Note that the model itself is often a minority of TTFT. Retrieval, network, and prefill on long prompts dominate.
TTFT vs total latency: which matters when
| UX pattern | What users feel | Optimize |
|---|---|---|
| Streaming chat | TTFT — the moment text starts | TTFT |
| Non-streaming JSON output (function calls, structured extraction) | Total latency | Total |
| Background jobs, async pipelines | Throughput, not latency | Cost per token |
| Voice agents | TTFT, aggressively | TTFT below 500ms ideally |
| Autocomplete / inline suggest | Total — must finish before user moves on | Both, with a hard cap |
How to actually measure it
Add spans for each layer. Generic APM won't do this for you out of the box.
- Wrap your embedding call with its own span. Tag the model name and input token count.
- Wrap vector search separately from any reranking step. Rerankers are often the surprise cost.
- Log prompt length in tokens, not characters. Prefill scales with tokens.
- Measure TTFT explicitly: timestamp the first SSE chunk from the provider, not the end of the response.
- Capture provider-side metrics if available. OpenAI returns timing headers; Anthropic and Azure expose similar. Compare against your own clock to find network gap.
If you only remember one thing: total_latency - ttft is your decode time. ttft - sum(your_spans) is provider-side prefill plus network. These two numbers tell you which half of the problem you have.
Interventions ranked by impact on TTFT
| Fix | Typical TTFT impact | Tradeoff |
|---|---|---|
| Stream responses (if you aren't) | Massive — perceived TTFT drops to first token | Harder to validate output mid-stream; JSON streaming is awkward |
| Shorten the prompt (fewer retrieved chunks, tighter system prompt) | Large — prefill scales with input tokens | May reduce answer quality; needs eval harness |
| Smaller / faster model for the first hop | Large | Quality drop, possibly need a router or fallback |
| Co-locate your server with the LLM provider region | 30–100ms | Vendor lock to a region |
| Cache embeddings for repeated queries | 30–150ms on hits | Cache invalidation when corpus updates |
| Replace cross-encoder reranker with a lighter one or skip it | 50–200ms | Retrieval quality may drop |
| Prompt caching (Anthropic, OpenAI, Gemini) | Large for long shared prefixes | Only helps when prefix is stable; minimums apply |
| Speculative decoding / smaller draft model | Provider-dependent | Mostly out of your hands on hosted APIs |
Interventions for total latency (decode-bound)
- Cap
max_tokensaggressively. Models will fill the space you give them. A 500-token cap on a summary is usually plenty. - Ask for terse output in the system prompt. "Respond in under 80 words." This works.
- Prefer structured output over prose when downstream code is the consumer. Less text to generate.
- Parallelize independent calls. If you make two LLM calls sequentially and they don't depend on each other, you're paying decode time twice.
- Skip the "reasoning out loud" pattern in production unless you actually need it. Chain-of-thought generation is decode time you're paying for.
Things that look like fixes but usually aren't
- Switching cloud providers rarely moves the needle unless you're cross-region. Measure first.
- Upgrading your vector DB when search is 30ms of an 800ms budget. Fix the bigger spans first.
- Adding more retrieval (more chunks, hybrid search, multi-query) almost always increases TTFT. Make sure quality wins justify it.
- Microservice splits for the LLM layer. Each hop adds a network round trip. Latency-sensitive paths want fewer hops, not more.
A practical triage order
- Are you streaming? If not, start there.
- Plot TTFT and total latency separately. Are users complaining about the first byte or the last?
- Instrument every span. Find the top two contributors.
- For TTFT: shorten input tokens, enable prompt caching, co-locate.
- For total: cap output tokens, tighten prompts, parallelize.
- Only then consider model swaps or infra changes.
Most teams we've worked with through our AI engineering practice find that the first round of instrumentation alone reshuffles their priority list. The bottleneck is rarely where the team assumed it was — it's usually retrieval, prefill on bloated prompts, or a sequential call chain that could be parallel.
Frequently Asked Questions
Why is my LLM response slow even though the model benchmark shows fast inference?
Public benchmarks measure model inference in isolation, usually on short prompts with warm caches. Production latency includes embedding the query, vector search, prompt construction, network hops to the provider, and prefill on prompts that may be 10x longer than the benchmark. The model itself is often less than half of your wall-clock time.
What's a good time-to-first-token target for a chat UI?
Under 1 second feels responsive when streaming; under 500ms feels instant. Voice agents need sub-500ms TTFT to avoid awkward pauses. For non-streaming structured output, target total latency under 2 seconds for interactive use; beyond that, move it to an async pattern with a loading state.
Does prompt caching actually help in production?
Yes, when you have a long stable prefix — a large system prompt, a fixed set of tool definitions, or a shared document context across requests. Anthropic, OpenAI, and Gemini all expose it now. It does not help when every request has a unique prompt, and minimum prefix sizes apply, so check the provider docs before assuming a win.
Should I run my own model to reduce latency?
Sometimes. Self-hosting can eliminate provider queue time and give you control over batching and quantization, but you trade that for capacity planning, GPU costs, and ops overhead. It's usually only worth it at scale, with predictable traffic, or when data residency requires it. If you're considering it for a specific workload, talk to CodeNicely for a personalized assessment.
How do I know if my retrieval layer is the bottleneck?
Instrument the embedding call and vector search as separate spans. If their combined time is more than 150–200ms and your TTFT is high, that's your fix. Common culprits: a cross-encoder reranker on every query, no embedding cache for repeated inputs, or fetching far more chunks than the model actually needs.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)