AI Evaluation Metrics Cheatsheet: Pick the Right One
For: A product-focused CTO at a seed-to-Series-A B2B SaaS company who just shipped their first LLM-powered feature and is being asked by the board whether it is 'working' — and realizes they have no defensible answer
Your board asks if the AI feature is working. You point at an 87% accuracy score. They nod. Six weeks later your largest customer escalates because the assistant confidently invented a refund policy. The metric never moved. It was never going to.
Evaluation metric selection is a product contract decision, not a statistics decision. The metric you pick encodes which failure your users can least afford — a missed case, a wrong answer, a slow answer, or a refusal. Most teams discover they chose wrong after the fact. This is a reference for choosing on purpose.
Step 1: Decide what kind of AI feature you actually shipped
The right metric depends on the task shape, not the model. Find your row:
| Feature type | Example | Primary metrics |
|---|---|---|
| Classification | Spam detection, ticket routing, intent detection | Precision, Recall, F1, AUC-ROC |
| Extraction | Pulling fields from invoices, resumes, contracts | Field-level F1, exact-match accuracy |
| Retrieval (RAG) | Doc search, knowledge assistant | Recall@k, MRR, nDCG, context precision |
| Generation (open-ended) | Email drafting, summarization, chat | Faithfulness, groundedness, LLM-as-judge, human eval |
| Generation (constrained) | SQL, JSON, code | Execution accuracy, schema validity, unit-test pass rate |
| Agentic / multi-step | Workflow automation, tool-using agents | Task success rate, step accuracy, cost per task |
Step 2: Precision, recall, F1 — when to use which
The classic confusion. The shortcut:
- Precision: of the things the model flagged, how many were right. Use when false positives are expensive.
- Recall: of the things that should have been flagged, how many were. Use when false negatives are expensive.
- F1: harmonic mean. Use when the costs are roughly symmetric, or you don't yet know which is worse.
| Scenario | Optimize for | Why |
|---|---|---|
| Fraud / KYC flagging | Recall (with a precision floor) | Missing a bad actor costs more than reviewing a good one |
| Auto-responding to support tickets | Precision | A confident wrong answer damages trust |
| Lead scoring for sales | Precision @ top-k | Sales only works the top of the list |
| Medical safety check (e.g. drug interactions) | Recall, near 100% | One miss is catastrophic; false alarms are tolerable |
| Spam filter | Precision (high), recall secondary | Users forgive spam in inbox; they don't forgive losing real mail |
For an example of where recall has to be near-perfect because misses are clinically dangerous, see how interaction checking is structured in our HealthPotli e-pharmacy build.
Step 3: Metrics for LLM-generated text
BLEU and ROUGE were built for machine translation against fixed references. They are weak proxies for whether a generated answer is correct, grounded, or useful. Stop reporting them to your board.
| Metric | What it measures | Good for | Bad at |
|---|---|---|---|
| BLEU / ROUGE | n-gram overlap with reference | Translation, tight summarization | Open-ended generation, factuality |
| BERTScore | Semantic similarity to reference | Paraphrase quality | Hallucinations that sound right |
| Faithfulness / Groundedness | Does output stay within retrieved context | RAG, summarization | Tasks with no source doc |
| Answer relevance | Does output address the question | QA, chat | Detecting subtle wrongness |
| LLM-as-judge | A stronger model scores outputs on a rubric | Scaling human eval | Bias toward verbose answers, model family bias |
| Human eval | SMEs rate on a rubric | Ground truth | Cost, throughput, inter-rater drift |
Practical setup: human eval on a frozen 100–300 example golden set, LLM-as-judge for CI on every pull request, faithfulness and answer-relevance tracked in production traffic.
Step 4: RAG-specific metrics
RAG fails in two places. Diagnose separately.
- Retrieval quality: Recall@k (is the right doc in the top k?), MRR (how high up?), context precision (how much of the retrieved context is actually relevant?).
- Generation quality given retrieval: Faithfulness (did the answer stick to the docs?), answer relevance (did it answer the question?).
If faithfulness is high but users complain, your retrieval is bad. If retrieval recall is high but answers are wrong, your generator is hallucinating or your prompt is leaking it permission to.
Step 5: The metrics nobody tracks until production breaks them
| Metric | Why it matters |
|---|---|
| p50 / p95 latency | A correct answer after 12 seconds is a churned user |
| Cost per successful task | Accuracy at any price is not a product |
| Refusal / abstention rate | Models that refuse too often look broken; too rarely, unsafe |
| Tool-call success rate | For agents, the model can be right and the agent still fail |
| User-side signals | Thumbs up/down, copy rate, edit-distance on drafted text, retry rate |
| Drift | Distribution shift in inputs over time — your metric can hold while reality moves |
Step 6: Tie the metric to a business outcome
The trap: optimizing a metric that doesn't move the number the feature was funded to move. Build a two-column doc before you instrument anything.
| Business outcome | Eval metric that should track it |
|---|---|
| Support deflection rate | Answer relevance + faithfulness + thumbs-up rate |
| Sales cycle compression | Precision@10 of lead scoring + rep acceptance rate |
| Underwriting approval throughput | Recall on risky applications + manual-review rate |
| Time-to-first-draft | p95 latency + edit-distance on accepted drafts |
If you cannot draw a line from the eval metric to a number a revenue leader cares about, you are measuring model behavior, not product value. For domain-heavy features — credit decisions, logistics routing, financial reconciliation — the metric has to encode the cost of each error type. See how this plays out in credit scoring at Cashpo or route optimization at Vahak.
Step 7: A minimum viable eval stack
- Golden set: 100–300 hand-labeled examples. Freeze it. Version it. Add edge cases as they come in from prod.
- Offline CI: run the golden set on every prompt or model change. Block merges on regression of your primary metric.
- Online metrics: log inputs, outputs, latency, cost, and a user signal (thumbs, accept/reject, edit distance) on every call.
- Periodic human review: sample 50 production traces per week. Score on a rubric. Watch for drift.
- One north-star metric per feature, tied to a business outcome. Everything else is a guardrail.
What the cheatsheet won't fix
Metrics don't tell you what your users actually wanted. They tell you whether the thing you built does the thing you said it would. If the spec was wrong, every metric will look fine while NPS drops. Pair evals with qualitative review of real conversations, every week, forever.
Frequently Asked Questions
Is accuracy ever the right metric for an LLM feature?
Rarely on its own. Accuracy assumes balanced classes and symmetric error costs, which is almost never true in production. For classification with imbalanced data use F1 or precision/recall at a fixed threshold. For generation, accuracy isn't well-defined — use faithfulness, answer relevance, or task success.
How do I evaluate an AI feature when I don't have labeled data yet?
Start with LLM-as-judge on a rubric tied to your product contract (e.g. "is this answer grounded in the provided document? yes/no"), and sample 50–100 outputs for human review weekly. Use those reviews to build a golden set incrementally. You don't need thousands of labels to catch regressions — you need a frozen, representative few hundred.
When should I trust LLM-as-judge versus human evaluation?
LLM-as-judge is good for scaling consistency checks and CI, but it has known biases — preference for longer answers, bias toward outputs from the same model family, and weakness on domain-specific correctness. Treat it as a regression detector, not ground truth. Calibrate it against human eval on your golden set quarterly.
How often should evaluation metrics be reviewed?
Offline metrics: every prompt or model change, automatically. Online metrics: daily dashboards, weekly review. The metric definition itself: every quarter, or whenever a customer escalation reveals a failure mode your metric didn't catch. Metrics rot — user behavior, model versions, and input distributions all shift.
What does it cost to build a proper evaluation pipeline?
That depends on the feature, the risk profile, and how much human labeling you need. For a scoped assessment of an eval stack for your specific LLM feature, talk to CodeNicely for a personalized review.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)