SaaS technology
Startups SaaS June 21, 2026 • 7 min read

AI Evaluation Metrics Cheatsheet: Pick the Right One

For: A product-focused CTO at a seed-to-Series-A B2B SaaS company who just shipped their first LLM-powered feature and is being asked by the board whether it is 'working' — and realizes they have no defensible answer

Your board asks if the AI feature is working. You point at an 87% accuracy score. They nod. Six weeks later your largest customer escalates because the assistant confidently invented a refund policy. The metric never moved. It was never going to.

Evaluation metric selection is a product contract decision, not a statistics decision. The metric you pick encodes which failure your users can least afford — a missed case, a wrong answer, a slow answer, or a refusal. Most teams discover they chose wrong after the fact. This is a reference for choosing on purpose.

Step 1: Decide what kind of AI feature you actually shipped

The right metric depends on the task shape, not the model. Find your row:

Feature typeExamplePrimary metrics
ClassificationSpam detection, ticket routing, intent detectionPrecision, Recall, F1, AUC-ROC
ExtractionPulling fields from invoices, resumes, contractsField-level F1, exact-match accuracy
Retrieval (RAG)Doc search, knowledge assistantRecall@k, MRR, nDCG, context precision
Generation (open-ended)Email drafting, summarization, chatFaithfulness, groundedness, LLM-as-judge, human eval
Generation (constrained)SQL, JSON, codeExecution accuracy, schema validity, unit-test pass rate
Agentic / multi-stepWorkflow automation, tool-using agentsTask success rate, step accuracy, cost per task

Step 2: Precision, recall, F1 — when to use which

The classic confusion. The shortcut:

ScenarioOptimize forWhy
Fraud / KYC flaggingRecall (with a precision floor)Missing a bad actor costs more than reviewing a good one
Auto-responding to support ticketsPrecisionA confident wrong answer damages trust
Lead scoring for salesPrecision @ top-kSales only works the top of the list
Medical safety check (e.g. drug interactions)Recall, near 100%One miss is catastrophic; false alarms are tolerable
Spam filterPrecision (high), recall secondaryUsers forgive spam in inbox; they don't forgive losing real mail

For an example of where recall has to be near-perfect because misses are clinically dangerous, see how interaction checking is structured in our HealthPotli e-pharmacy build.

Step 3: Metrics for LLM-generated text

BLEU and ROUGE were built for machine translation against fixed references. They are weak proxies for whether a generated answer is correct, grounded, or useful. Stop reporting them to your board.

MetricWhat it measuresGood forBad at
BLEU / ROUGEn-gram overlap with referenceTranslation, tight summarizationOpen-ended generation, factuality
BERTScoreSemantic similarity to referenceParaphrase qualityHallucinations that sound right
Faithfulness / GroundednessDoes output stay within retrieved contextRAG, summarizationTasks with no source doc
Answer relevanceDoes output address the questionQA, chatDetecting subtle wrongness
LLM-as-judgeA stronger model scores outputs on a rubricScaling human evalBias toward verbose answers, model family bias
Human evalSMEs rate on a rubricGround truthCost, throughput, inter-rater drift

Practical setup: human eval on a frozen 100–300 example golden set, LLM-as-judge for CI on every pull request, faithfulness and answer-relevance tracked in production traffic.

Step 4: RAG-specific metrics

RAG fails in two places. Diagnose separately.

If faithfulness is high but users complain, your retrieval is bad. If retrieval recall is high but answers are wrong, your generator is hallucinating or your prompt is leaking it permission to.

Step 5: The metrics nobody tracks until production breaks them

MetricWhy it matters
p50 / p95 latencyA correct answer after 12 seconds is a churned user
Cost per successful taskAccuracy at any price is not a product
Refusal / abstention rateModels that refuse too often look broken; too rarely, unsafe
Tool-call success rateFor agents, the model can be right and the agent still fail
User-side signalsThumbs up/down, copy rate, edit-distance on drafted text, retry rate
DriftDistribution shift in inputs over time — your metric can hold while reality moves

Step 6: Tie the metric to a business outcome

The trap: optimizing a metric that doesn't move the number the feature was funded to move. Build a two-column doc before you instrument anything.

Business outcomeEval metric that should track it
Support deflection rateAnswer relevance + faithfulness + thumbs-up rate
Sales cycle compressionPrecision@10 of lead scoring + rep acceptance rate
Underwriting approval throughputRecall on risky applications + manual-review rate
Time-to-first-draftp95 latency + edit-distance on accepted drafts

If you cannot draw a line from the eval metric to a number a revenue leader cares about, you are measuring model behavior, not product value. For domain-heavy features — credit decisions, logistics routing, financial reconciliation — the metric has to encode the cost of each error type. See how this plays out in credit scoring at Cashpo or route optimization at Vahak.

Step 7: A minimum viable eval stack

  1. Golden set: 100–300 hand-labeled examples. Freeze it. Version it. Add edge cases as they come in from prod.
  2. Offline CI: run the golden set on every prompt or model change. Block merges on regression of your primary metric.
  3. Online metrics: log inputs, outputs, latency, cost, and a user signal (thumbs, accept/reject, edit distance) on every call.
  4. Periodic human review: sample 50 production traces per week. Score on a rubric. Watch for drift.
  5. One north-star metric per feature, tied to a business outcome. Everything else is a guardrail.

What the cheatsheet won't fix

Metrics don't tell you what your users actually wanted. They tell you whether the thing you built does the thing you said it would. If the spec was wrong, every metric will look fine while NPS drops. Pair evals with qualitative review of real conversations, every week, forever.

Frequently Asked Questions

Is accuracy ever the right metric for an LLM feature?

Rarely on its own. Accuracy assumes balanced classes and symmetric error costs, which is almost never true in production. For classification with imbalanced data use F1 or precision/recall at a fixed threshold. For generation, accuracy isn't well-defined — use faithfulness, answer relevance, or task success.

How do I evaluate an AI feature when I don't have labeled data yet?

Start with LLM-as-judge on a rubric tied to your product contract (e.g. "is this answer grounded in the provided document? yes/no"), and sample 50–100 outputs for human review weekly. Use those reviews to build a golden set incrementally. You don't need thousands of labels to catch regressions — you need a frozen, representative few hundred.

When should I trust LLM-as-judge versus human evaluation?

LLM-as-judge is good for scaling consistency checks and CI, but it has known biases — preference for longer answers, bias toward outputs from the same model family, and weakness on domain-specific correctness. Treat it as a regression detector, not ground truth. Calibrate it against human eval on your golden set quarterly.

How often should evaluation metrics be reviewed?

Offline metrics: every prompt or model change, automatically. Online metrics: daily dashboards, weekly review. The metric definition itself: every quarter, or whenever a customer escalation reveals a failure mode your metric didn't catch. Metrics rot — user behavior, model versions, and input distributions all shift.

What does it cost to build a proper evaluation pipeline?

That depends on the feature, the risk profile, and how much human labeling you need. For a scoped assessment of an eval stack for your specific LLM feature, talk to CodeNicely for a personalized review.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.