SaaS technology
Businesses SaaS May 4, 2026 • 6 min read

AI Observability Stack: What to Monitor and When

For: A CTO at a Series B SaaS company who shipped an AI feature six months ago, has no model failures on the dashboard, but is fielding increasing user complaints about output quality — and realizes they have no systematic way to know if the model is silently degrading

Your dashboards are green. Latency is fine, error rate is under 0.5%, the model returns 200s. And yet support tickets keep coming in: "the answer is wrong," "it made something up," "this used to work better." The gap between those two realities is where AI observability lives — and where most teams who shipped an LLM feature six months ago are flying blind right now.

This is a reference for what to monitor, why traditional APM misses it, and when each signal matters.

Why traditional APM is structurally blind to AI failure

Standard observability assumes a deterministic system. AI features are probabilistic. The output can be wrong while every infrastructure metric is healthy.

APM tells youAPM cannot tell you
The model responded in 800msWhether the response was correct
Zero 5xx errorsHallucination rate over the last 24 hours
Token throughput is stableWhether inputs have drifted from training distribution
The prompt template loadedWhether your last prompt edit regressed quality on edge cases
The vector DB is upWhether retrieval is returning relevant chunks

The four AI-specific failure modes you need to instrument

1. Drift (input and output)

Input drift: the distribution of user prompts has shifted away from what you tested or fine-tuned on.
Output drift: the model's responses have shifted in length, tone, structure, or sentiment over time.

2. Hallucination rate

3. Prompt regression

Every prompt edit is a deploy. Treat it that way.

4. Feedback loop poisoning

If you fine-tune on user interactions or use thumbs-up/down to weight retrieval, bad signal compounds.

The monitoring checklist by layer

LayerWhat to monitorFrequencyAlert threshold
InfraLatency p50/p95/p99, error rate, token throughput, rate limitsReal-timeStandard SRE thresholds
CostTokens per request, cost per user, cache hit rateHourly>20% week-over-week increase
Retrieval (RAG)Recall@k, chunk relevance score, empty retrieval rateReal-timeEmpty retrievals >2%
Model outputGroundedness, format adherence, refusal rate, output length distributionSampled, hourly aggregateGroundedness <0.85, refusal spike >2x baseline
DriftInput embedding distribution vs. baselineDailyKL divergence above empirical threshold
User signalThumbs up/down rate, regenerate rate, copy rate, session abandonmentReal-timeThumbs-down >1.5x rolling 7d
SafetyPII leakage, jailbreak attempts, toxicity flagsReal-timeAny confirmed leak = page

When to add each signal

StageAdd theseSkip these for now
Pre-launchEval set, prompt version control, latency/error, basic safety filtersDrift detection, fine-tune feedback loops
0–10k usersSampled groundedness scoring, thumbs feedback, cost per userEmbedding drift dashboards
10k+ users / multiple promptsDrift detection, regression alerts, retrieval recall, judge model in CI
Post fine-tuneFrozen eval set, win-rate gating, training data audits

Tooling: pick one stack, not five

Honest tradeoff: judge models cost real money and add latency if run inline. Sample. Run async. Don't try to score every request.

What this approach is bad at

How CodeNicely can help

If you're at the stage where the dashboard is green but the complaints are real, you need someone who has built this instrumentation against a regulated, high-stakes output — not a chatbot demo. Our work on HealthPotli involved an AI drug-interaction layer where a wrong answer is not a UX bug, it's a safety event. That engagement forced us to build sampled groundedness scoring, frozen eval sets, and drift alerts before the product could go live — the same pattern a Series B SaaS team needs once user trust is on the line.

If your AI feature is closer to a fintech or workflow context, the GimBooks work is a closer analog: structured-output reliability, prompt versioning, and regression gating on every change. Either way, see CodeNicely AI Studio for how we typically scope an observability retrofit on an existing AI feature.

Frequently Asked Questions

What is AI observability and how is it different from APM?

AI observability monitors the quality and behavior of model outputs — drift, hallucination, groundedness, prompt regression — in addition to standard infrastructure metrics. APM tells you the model responded; AI observability tells you whether the response was correct. Both are required; neither replaces the other.

How do I detect LLM drift in production?

Compute embeddings of incoming prompts, store a baseline distribution from a known-good window, and compare current windows using KL divergence or Wasserstein distance. Bucket by endpoint or user segment, not globally. Pair this with output-side signals like length distribution and refusal rate, which often shift before users notice quality issues.

What's the minimum AI monitoring checklist for a feature already in production?

At minimum: a versioned eval set run on every prompt change, sampled groundedness or correctness scoring on 1–2% of live traffic, user feedback signals (thumbs, regenerate, copy), token cost per request, and weekly input drift checks. Add retrieval recall metrics if you're using RAG. Skip drift dashboards until you have enough volume for them to be meaningful.

Should I use LLM-as-judge for production monitoring?

Yes, but sampled and async — not inline on every request. Inline judging doubles your cost and latency. Sample 1–2% of traffic, score it offline, and aggregate. Calibrate the judge model against periodic human review so you know its bias and false-positive rate.

How much does it cost to retrofit observability onto an existing AI feature?

It depends on your current stack, traffic volume, and how regulated the output is. Contact CodeNicely for a personalized assessment — we'll scope it against your actual architecture rather than a generic estimate.

Building something in SaaS?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team