How to Cut AI Inference Costs Without Touching Your Model
For: A Series A SaaS CTO whose AI feature shipped six months ago, is getting real usage, and just got a cloud bill that makes the unit economics unworkable — but whose team is assuming the fix is a cheaper model or more caching, not a routing and request-shaping problem
Your AI feature shipped six months ago. Usage is real. The cloud bill is now growing faster than revenue, and your team's first instinct is to swap to a cheaper model or bolt on more caching. Both will help a little. Neither addresses what is actually broken.
Here is the part most teams miss: AI inference overspend is rarely a model problem. It is a routing problem. You are sending trivial requests, moderate requests, and genuinely complex requests through the same expensive endpoint, paying the worst-case price for every single call. Once you stop doing that, the cost curve bends without anyone noticing a quality drop.
This playbook is for the CTO who has shipped, has data, and is now staring at a unit economics problem. It assumes you have logs, you control the call site, and you have a week or two of engineering bandwidth to spend on this. If that is you, run these steps in order.
When this playbook applies
You should run this if:
- You have at least 30 days of production inference logs with prompt, response, latency, and token counts
- Your AI feature calls a hosted LLM API (OpenAI, Anthropic, Bedrock, Vertex, Groq, etc.)
- Your bill is dominated by inference, not embeddings or fine-tuning
- You have already tried "use a smaller model" or "add caching" and the savings were disappointing
If you are pre-launch or sub-1,000 calls a day, do not bother. The complexity is not worth it yet.
Step 1: Profile your request distribution before you change anything
Before you touch routing, you need to know what you are routing. Pull 7 to 14 days of production traffic and bucket every request on three axes:
- Input complexity — input token count, presence of structured context (RAG chunks, tool outputs), number of conversation turns
- Output requirement — does this need reasoning, generation, classification, extraction, or rewriting?
- Visibility — will a human read this output, or is it a background job (summarization for search, tagging, moderation pre-filter)?
You are looking for the shape of the distribution. In almost every SaaS workload I have profiled, it looks roughly the same: 50 to 70 percent of calls are trivial (classification, extraction, short rewrites, intent detection), 20 to 35 percent are moderate (single-turn generation with light context), and 5 to 15 percent are genuinely hard (multi-turn reasoning, long-context synthesis, code generation with tool use).
You are paying GPT-4-class prices for the 70 percent that does not need it.
Anti-pattern: skipping this step because "we already know our traffic." You do not. Engineers consistently overestimate the share of complex requests by 2-3x because complex requests are what they remember debugging.
You'll know this step is done when you can show a histogram of input token count, a pie chart of request types, and you can name the top 3 highest-volume request shapes by call site.
Step 2: Kill the requests that should never have been made
Before you optimize routing, eliminate waste. Walk through your top request shapes and look for these:
- Background generations no user will see. A surprising number of teams generate AI summaries, tags, or descriptions on write, then never surface them, or surface them only for the 5 percent of records a user actually opens. Move these to lazy generation on read, with a cache.
- Re-generations of identical inputs. Hash the normalized prompt and check a Redis cache before calling. This is the "more caching" advice your team already gave you, but most implementations cache too narrowly — they cache full prompts when they should cache sub-components (the system prompt expansion, the RAG retrieval, the tool schema rendering).
- Streaming when the consumer does not stream. If your frontend waits for the full response anyway, streaming costs you nothing extra but if a backend job is "streaming" into a buffer, you are paying for a UX you do not have.
- Retry storms. Check your retry policy. Exponential backoff with jitter, max 2 retries, and never retry on 4xx. I have seen bills inflated 15 to 20 percent by aggressive retry on rate-limit responses.
- Debug logging that calls the model. Yes, this happens. Someone added a "explain why this was classified this way" debug call and forgot to gate it behind a flag.
Anti-pattern: trying to reduce token counts on prompts before you have eliminated wasted calls. Saving 200 tokens on a request you should not have made is theater.
You'll know this step is done when you have a list of every call site, each one has a documented reason it exists, and you have removed or deferred at least the top 2 sources of waste.
Step 3: Build a router, not a model swap
This is the architectural shift. Replace your single inference endpoint with a router that classifies the request and dispatches it to the cheapest model that can handle it. The router itself should be cheap — a small classifier, a rules engine, or a tiny model running on your own infrastructure.
A pragmatic three-tier setup:
- Tier 1 (fast lane): classification, extraction, short rewrites, intent detection, moderation. Use Haiku, Gemini Flash, GPT-4o-mini, Llama 3.1 8B on Groq, or a fine-tuned small model. Often 10-30x cheaper per token than frontier models.
- Tier 2 (standard): single-turn generation with context, summarization, structured output. Use Sonnet, GPT-4o, Gemini Pro.
- Tier 3 (heavy): multi-step reasoning, long-context synthesis, agentic tool use, code generation. Use Opus, GPT-4 Turbo, o1, Gemini 1.5 Pro with full context.
The router decides tier based on the features you profiled in Step 1: input length, request type, user tier, whether tools are required. Start with rules. Hard-coded rules will get you 80 percent of the way there. Do not build an ML classifier for routing on day one.
Anti-pattern: "let's use a model to route to models." You will pay for two inference calls instead of one and the routing model will be wrong often enough to negate the savings. Rules first. Always.
Anti-pattern: routing by user tier alone (free vs paid). Tempting, but it punishes free users with bad output on requests that genuinely need a strong model, and overspends on paid users for requests a small model handles fine. Route by request shape, then apply tier as a modifier (e.g., free users get fewer Tier 3 escalations per day).
You'll know this step is done when at least 50 percent of production traffic is being served by Tier 1, you have a fallback path from Tier 1 to Tier 2 on low-confidence outputs, and your p95 latency on Tier 1 requests is under 1 second.
Step 4: Add a confidence-based escalation path
Routing alone is not enough. You need a way to recover when the small model gets it wrong, without doing two full calls every time.
Patterns that work:
- Self-reported confidence. Ask the small model to return a confidence score with structured output. Below a threshold, escalate to Tier 2. Calibrate the threshold from your eval set.
- Validator pass. For structured outputs (JSON extraction, classification), validate the schema and required fields. On failure, escalate.
- Consistency check. For high-stakes outputs, run the small model twice with temperature 0 and a tiny perturbation. If they disagree, escalate.
- Length and format heuristics. If the small model returns suspiciously short output, refuses, or hits a length cap, escalate.
The math: if Tier 1 handles 60 percent of traffic at 1/20th the cost, and you escalate 15 percent of those to Tier 2 (paying for both calls), your effective cost on that 60 percent is still roughly 20 percent of what you were paying. That is the win.
Anti-pattern: escalating on every Tier 1 call "just to be safe." You have now built a more expensive system than you started with.
You'll know this step is done when your escalation rate from Tier 1 to Tier 2 is between 10 and 20 percent, and your eval suite shows quality parity with the all-Tier-3 baseline within an acceptable margin (typically < 2 percent regression on your task-specific metric).
Step 5: Shape the request itself
Now that routing is in place, attack the requests that remain. Per-request token reduction compounds.
- Trim system prompts. Most production system prompts are 1,500-3,000 tokens of accumulated "just in case" instructions. Audit them. Remove examples that no longer apply. Move rarely-triggered instructions to conditional injection.
- Compress RAG context. If you are stuffing 8 chunks into context, you are probably wasting half. Re-rank, then take top 3-4. Use a smaller embedding model for retrieval — they are usually fine.
- Use prompt caching where the provider offers it. Anthropic and OpenAI both support prompt caching for repeated system prompts and context. The discount on cached input tokens is significant (Anthropic: 90 percent off on cache hits). This is free money if your system prompt is stable.
- Cap output tokens aggressively. Set max_tokens to the realistic ceiling for the task, not the model maximum. A classifier that returns "positive/negative/neutral" does not need max_tokens=4096.
- Batch where you can. If you have non-realtime workloads (nightly tagging, bulk summarization), use the batch APIs. OpenAI and Anthropic both offer ~50 percent discounts on batch.
You'll know this step is done when your average input tokens per call has dropped at least 20 percent and your prompt cache hit rate is above 60 percent for Tier 2 and Tier 3 traffic.
Step 6: Instrument cost per feature, not per call
The mistake most teams make after a cost optimization sprint is celebrating the bill drop and moving on. Six months later, the bill is back where it started because a new feature shipped without cost discipline.
Build a dashboard that shows cost per feature, cost per user cohort, and cost per request type. Tag every inference call with these dimensions at the call site. When a PM proposes a new AI feature, you can now answer "what will this cost per active user per month" before you ship.
Set alerts on cost per request by tier. If your Tier 1 average cost per request creeps up, someone has either added tokens to the prompt or the router is misclassifying.
You'll know this step is done when you can answer "what does the AI in this feature cost us per paying customer per month" in under 5 minutes, without a spreadsheet.
Step 7: Only now consider model swaps and fine-tuning
If you have done everything above and the unit economics are still off, then talk about model swaps and fine-tuning. By this point you have:
- A clean routing layer that makes swapping any single tier safe
- An eval suite that tells you immediately if a swap regresses quality
- A dataset of real production traffic, bucketed by tier, that is the perfect input for fine-tuning a small open-source model on Tier 1 traffic
Fine-tuning a Llama 3.1 8B or a Qwen 2.5 7B on your Tier 1 traffic, hosted on Together, Fireworks, or your own GPUs, often beats hosted small models on cost and latency for a specific task. But this is only worth doing once you know which task. Most teams skip to this step first and end up fine-tuning a model for a workload that has not been properly scoped.
You'll know this step is done when you have either decided fine-tuning is worth it (with a real dataset and an eval delta target) or decided it is not, and documented why.
Failure modes I have seen
The router becomes a bottleneck. Someone implements the classifier as a synchronous call to a hosted model and adds 400ms to every request. Keep the router local, fast, and cheap. A few hundred lines of rules, or a small model on your own infra.
Quality regression hidden in the average. Aggregate eval metrics look fine but a specific high-value cohort (enterprise users, a particular language, a specific feature) silently degrades. Always slice your eval set by cohort.
Caching the wrong thing. Teams cache full prompts including the user's unique input and get a 2 percent hit rate. Cache the parts that repeat (system prompt rendering, retrieved chunks, schema descriptions) and assemble per request.
Tier sprawl. You start with three tiers and end up with seven, each with its own model, its own prompt, and its own eval. Resist. Three tiers covers 95 percent of the value. Adding a fourth doubles the maintenance cost for marginal savings.
Forgetting that providers change pricing. Re-evaluate quarterly. The price of Haiku, Flash, and GPT-4o-mini class models has dropped repeatedly. A routing decision made nine months ago may no longer be optimal.
Optimizing the wrong half of the bill. Always check: is your spend dominated by input tokens or output tokens? If it is input-heavy (long context, RAG), prompt caching and context compression dominate. If it is output-heavy (long generations), max_tokens caps and cheaper output models dominate. The fixes are different.
How CodeNicely can help
This is the kind of work we do most often with Series A and growth-stage SaaS teams whose AI features are out of the prototype phase and into the "real bill, real users, real consequences" phase. The closest analog in our case studies is GimBooks, the YC-backed accounting SaaS we have worked with on production systems that serve a high volume of varied requests at predictable unit economics — exactly the routing-and-shaping problem this playbook describes. The lessons there about treating different request shapes differently, instead of flattening everything into one expensive path, transfer directly.
If your team is sitting on the inference bill problem and does not have the bandwidth to run this playbook internally, our AI Studio team does this kind of profiling, routing layer build-out, and eval harness work as a focused engagement. Talk to us for a personalized assessment of where your spend is going and what is realistically recoverable.
Frequently Asked Questions
Will routing to smaller models hurt my output quality?
Not if you build the eval suite first and an escalation path second. The point of confidence-based escalation is that you catch the cases where the small model is wrong and recover them on a larger model, while still capturing the savings on the majority of requests where the small model is fine. Quality regression happens when teams swap models without measurement, not when they route with measurement.
How is this different from just adding a cache?
Caching only helps when the same input arrives repeatedly. In most SaaS AI workloads, exact-prompt cache hit rates are under 10 percent because user inputs are unique. Routing helps on every request, not just repeats, by matching request difficulty to model cost. The two are complementary — do both, but routing has a much higher ceiling on savings.
Should we self-host an open-source model instead?
Maybe, eventually, for specific tiers. Self-hosting Llama or Qwen on your own GPUs makes sense once you have stable Tier 1 traffic high enough to keep a GPU saturated, and a fine-tuning dataset to beat the hosted small models on your specific task. Below that threshold, hosted small models (Haiku, Flash, GPT-4o-mini, Groq's Llama endpoints) are cheaper and lower-operational-burden than self-hosting.
How much can we realistically save with this approach?
It depends entirely on your current request distribution and how much of your traffic is currently over-served by an oversized model. Teams that have never done routing typically find significant headroom; teams that have already done basic optimization find less. Contact CodeNicely for a personalized assessment of your specific workload.
How long does it take to implement a routing layer?
The scope varies a lot depending on the number of call sites, the maturity of your eval setup, and how much waste exists in Step 2. We would rather scope this against your actual codebase than quote a generic number — reach out for a personalized assessment.
Building something in SaaS?
CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.
Talk to our team_1751731246795-BygAaJJK.png)