Startups SaaS June 21, 2026 • 7 min read

AI Evaluation Metrics Cheatsheet: Pick the Right One

Q: When should I trust LLM-as-judge versus human evaluation?

LLM-as-judge is good for scaling consistency checks and CI pipelines, but it has known biases — preference for longer answers, same-model-family bias, and weakness on domain-specific correctness. Treat it as a regression detector, not ground truth, and calibrate it against human eval on your golden set quarterly.

Q: How often should evaluation metrics be reviewed?

Offline metrics on every prompt or model change, automatically. Online metrics on daily dashboards with weekly review. The metric definition itself should be revisited every quarter or whenever a customer escalation reveals a failure mode your metric missed. Metrics rot as user behavior and input distributions shift.

Q: What does it cost to build a proper evaluation pipeline?

It depends on the feature, the risk profile, and how much human labeling is involved. For a scoped assessment of an evaluation stack for your specific LLM feature, contact CodeNicely for a personalized assessment.

For: A product-focused CTO at a seed-to-Series-A B2B SaaS company who just shipped their first LLM-powered feature and is being asked by the board whether it is 'working' — and realizes they have no defensible answer

Your board asks if the AI feature is working. You point at an 87% accuracy score. They nod. Six weeks later your largest customer escalates because the assistant confidently invented a refund policy. The metric never moved. It was never going to.

Evaluation metric selection is a product contract decision, not a statistics decision. The metric you pick encodes which failure your users can least afford — a missed case, a wrong answer, a slow answer, or a refusal. Most teams discover they chose wrong after the fact. This is a reference for choosing on purpose.

Step 1: Decide what kind of AI feature you actually shipped

The right metric depends on the task shape, not the model. Find your row:

Feature type	Example	Primary metrics
Classification	Spam detection, ticket routing, intent detection	Precision, Recall, F1, AUC-ROC
Extraction	Pulling fields from invoices, resumes, contracts	Field-level F1, exact-match accuracy
Retrieval (RAG)	Doc search, knowledge assistant	Recall@k, MRR, nDCG, context precision
Generation (open-ended)	Email drafting, summarization, chat	Faithfulness, groundedness, LLM-as-judge, human eval
Generation (constrained)	SQL, JSON, code	Execution accuracy, schema validity, unit-test pass rate
Agentic / multi-step	Workflow automation, tool-using agents	Task success rate, step accuracy, cost per task

Step 2: Precision, recall, F1 — when to use which

The classic confusion. The shortcut:

Precision: of the things the model flagged, how many were right. Use when false positives are expensive.
Recall: of the things that should have been flagged, how many were. Use when false negatives are expensive.
F1: harmonic mean. Use when the costs are roughly symmetric, or you don't yet know which is worse.

Scenario	Optimize for	Why
Fraud / KYC flagging	Recall (with a precision floor)	Missing a bad actor costs more than reviewing a good one
Auto-responding to support tickets	Precision	A confident wrong answer damages trust
Lead scoring for sales	Precision @ top-k	Sales only works the top of the list
Medical safety check (e.g. drug interactions)	Recall, near 100%	One miss is catastrophic; false alarms are tolerable
Spam filter	Precision (high), recall secondary	Users forgive spam in inbox; they don't forgive losing real mail

For an example of where recall has to be near-perfect because misses are clinically dangerous, see how interaction checking is structured in our HealthPotli e-pharmacy build.

Step 3: Metrics for LLM-generated text

BLEU and ROUGE were built for machine translation against fixed references. They are weak proxies for whether a generated answer is correct, grounded, or useful. Stop reporting them to your board.

Metric	What it measures	Good for	Bad at
BLEU / ROUGE	n-gram overlap with reference	Translation, tight summarization	Open-ended generation, factuality
BERTScore	Semantic similarity to reference	Paraphrase quality	Hallucinations that sound right
Faithfulness / Groundedness	Does output stay within retrieved context	RAG, summarization	Tasks with no source doc
Answer relevance	Does output address the question	QA, chat	Detecting subtle wrongness
LLM-as-judge	A stronger model scores outputs on a rubric	Scaling human eval	Bias toward verbose answers, model family bias
Human eval	SMEs rate on a rubric	Ground truth	Cost, throughput, inter-rater drift

Practical setup: human eval on a frozen 100–300 example golden set, LLM-as-judge for CI on every pull request, faithfulness and answer-relevance tracked in production traffic.

Step 4: RAG-specific metrics

RAG fails in two places. Diagnose separately.

Retrieval quality: Recall@k (is the right doc in the top k?), MRR (how high up?), context precision (how much of the retrieved context is actually relevant?).
Generation quality given retrieval: Faithfulness (did the answer stick to the docs?), answer relevance (did it answer the question?).

If faithfulness is high but users complain, your retrieval is bad. If retrieval recall is high but answers are wrong, your generator is hallucinating or your prompt is leaking it permission to.

Step 5: The metrics nobody tracks until production breaks them

Metric	Why it matters
p50 / p95 latency	A correct answer after 12 seconds is a churned user
Cost per successful task	Accuracy at any price is not a product
Refusal / abstention rate	Models that refuse too often look broken; too rarely, unsafe
Tool-call success rate	For agents, the model can be right and the agent still fail
User-side signals	Thumbs up/down, copy rate, edit-distance on drafted text, retry rate
Drift	Distribution shift in inputs over time — your metric can hold while reality moves

Step 6: Tie the metric to a business outcome

The trap: optimizing a metric that doesn't move the number the feature was funded to move. Build a two-column doc before you instrument anything.

Business outcome	Eval metric that should track it
Support deflection rate	Answer relevance + faithfulness + thumbs-up rate
Sales cycle compression	Precision@10 of lead scoring + rep acceptance rate
Underwriting approval throughput	Recall on risky applications + manual-review rate
Time-to-first-draft	p95 latency + edit-distance on accepted drafts

If you cannot draw a line from the eval metric to a number a revenue leader cares about, you are measuring model behavior, not product value. For domain-heavy features — credit decisions, logistics routing, financial reconciliation — the metric has to encode the cost of each error type. See how this plays out in credit scoring at Cashpo or route optimization at Vahak.

Step 7: A minimum viable eval stack

Golden set: 100–300 hand-labeled examples. Freeze it. Version it. Add edge cases as they come in from prod.
Offline CI: run the golden set on every prompt or model change. Block merges on regression of your primary metric.
Online metrics: log inputs, outputs, latency, cost, and a user signal (thumbs, accept/reject, edit distance) on every call.
Periodic human review: sample 50 production traces per week. Score on a rubric. Watch for drift.
One north-star metric per feature, tied to a business outcome. Everything else is a guardrail.

What the cheatsheet won't fix

Metrics don't tell you what your users actually wanted. They tell you whether the thing you built does the thing you said it would. If the spec was wrong, every metric will look fine while NPS drops. Pair evals with qualitative review of real conversations, every week, forever.

Frequently Asked Questions

Is accuracy ever the right metric for an LLM feature?

Rarely on its own. Accuracy assumes balanced classes and symmetric error costs, which is almost never true in production. For classification with imbalanced data use F1 or precision/recall at a fixed threshold. For generation, accuracy isn't well-defined — use faithfulness, answer relevance, or task success.

How do I evaluate an AI feature when I don't have labeled data yet?

Start with LLM-as-judge on a rubric tied to your product contract (e.g. "is this answer grounded in the provided document? yes/no"), and sample 50–100 outputs for human review weekly. Use those reviews to build a golden set incrementally. You don't need thousands of labels to catch regressions — you need a frozen, representative few hundred.

When should I trust LLM-as-judge versus human evaluation?

LLM-as-judge is good for scaling consistency checks and CI, but it has known biases — preference for longer answers, bias toward outputs from the same model family, and weakness on domain-specific correctness. Treat it as a regression detector, not ground truth. Calibrate it against human eval on your golden set quarterly.

How often should evaluation metrics be reviewed?

Offline metrics: every prompt or model change, automatically. Online metrics: daily dashboards, weekly review. The metric definition itself: every quarter, or whenever a customer escalation reveals a failure mode your metric didn't catch. Metrics rot — user behavior, model versions, and input distributions all shift.

What does it cost to build a proper evaluation pipeline?

That depends on the feature, the risk profile, and how much human labeling you need. For a scoped assessment of an eval stack for your specific LLM feature, talk to CodeNicely for a personalized review.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.