Businesses SaaS May 4, 2026 • 6 min read

AI Observability Stack: What to Monitor and When

For: A CTO at a Series B SaaS company who shipped an AI feature six months ago, has no model failures on the dashboard, but is fielding increasing user complaints about output quality — and realizes they have no systematic way to know if the model is silently degrading

Your dashboards are green. Latency is fine, error rate is under 0.5%, the model returns 200s. And yet support tickets keep coming in: "the answer is wrong," "it made something up," "this used to work better." The gap between those two realities is where AI observability lives — and where most teams who shipped an LLM feature six months ago are flying blind right now.

This is a reference for what to monitor, why traditional APM misses it, and when each signal matters.

Why traditional APM is structurally blind to AI failure

Standard observability assumes a deterministic system. AI features are probabilistic. The output can be wrong while every infrastructure metric is healthy.

APM tells you	APM cannot tell you
The model responded in 800ms	Whether the response was correct
Zero 5xx errors	Hallucination rate over the last 24 hours
Token throughput is stable	Whether inputs have drifted from training distribution
The prompt template loaded	Whether your last prompt edit regressed quality on edge cases
The vector DB is up	Whether retrieval is returning relevant chunks

The four AI-specific failure modes you need to instrument

1. Drift (input and output)

Input drift: the distribution of user prompts has shifted away from what you tested or fine-tuned on.
Output drift: the model's responses have shifted in length, tone, structure, or sentiment over time.

Track embedding distributions weekly. Compare current week to a baseline window using KL divergence or Wasserstein distance.
Bucket by feature/endpoint, not globally. Drift hides in subpopulations.
Watch output length distribution as a cheap canary — sudden shifts often precede quality regressions.

2. Hallucination rate

For RAG: measure groundedness — what percentage of claims in the output are supported by retrieved context. Tools: Ragas, TruLens, or an LLM-as-judge with a rubric.
For non-RAG: sample 1–2% of traffic for offline judge scoring. Don't try to score 100% in real time, the cost balloons.
Set a threshold (e.g. groundedness < 0.85) and alert on the rolling 24h rate.

3. Prompt regression

Every prompt edit is a deploy. Treat it that way.

Maintain an eval set of 50–500 examples per critical prompt. Run it on every prompt change before merging.
Track per-version metrics: accuracy, format adherence, refusal rate, latency, token cost.
Tools worth looking at: PromptLayer, Langfuse, Helicone, Braintrust, Arize Phoenix.

4. Feedback loop poisoning

If you fine-tune on user interactions or use thumbs-up/down to weight retrieval, bad signal compounds.

Audit the training set before each fine-tune cycle. Check for adversarial inputs, duplicates, and class imbalance.
Hold out a clean, frozen eval set the production data never touches.
Track win rate of the new model vs. the previous one on that frozen set. If it doesn't beat the predecessor, don't ship.

The monitoring checklist by layer

Layer	What to monitor	Frequency	Alert threshold
Infra	Latency p50/p95/p99, error rate, token throughput, rate limits	Real-time	Standard SRE thresholds
Cost	Tokens per request, cost per user, cache hit rate	Hourly	>20% week-over-week increase
Retrieval (RAG)	Recall@k, chunk relevance score, empty retrieval rate	Real-time	Empty retrievals >2%
Model output	Groundedness, format adherence, refusal rate, output length distribution	Sampled, hourly aggregate	Groundedness <0.85, refusal spike >2x baseline
Drift	Input embedding distribution vs. baseline	Daily	KL divergence above empirical threshold
User signal	Thumbs up/down rate, regenerate rate, copy rate, session abandonment	Real-time	Thumbs-down >1.5x rolling 7d
Safety	PII leakage, jailbreak attempts, toxicity flags	Real-time	Any confirmed leak = page

When to add each signal

Stage	Add these	Skip these for now
Pre-launch	Eval set, prompt version control, latency/error, basic safety filters	Drift detection, fine-tune feedback loops
0–10k users	Sampled groundedness scoring, thumbs feedback, cost per user	Embedding drift dashboards
10k+ users / multiple prompts	Drift detection, regression alerts, retrieval recall, judge model in CI	—
Post fine-tune	Frozen eval set, win-rate gating, training data audits	—

Tooling: pick one stack, not five

All-in-one LLM observability: Langfuse (open source), Arize Phoenix, Helicone, LangSmith, Braintrust.
Eval frameworks: Ragas (RAG-specific), DeepEval, promptfoo, OpenAI Evals.
Drift / ML monitoring: Evidently, WhyLabs, Arize.
Tracing: OpenTelemetry GenAI semantic conventions are stabilizing — instrument with OTel if you want vendor portability.

Honest tradeoff: judge models cost real money and add latency if run inline. Sample. Run async. Don't try to score every request.

What this approach is bad at

LLM-as-judge has its own bias and variance. Calibrate against human review periodically — don't trust it blindly.
Drift metrics tell you something changed, not whether the change is bad. You still need humans in the loop for ambiguous cases.
Eval sets go stale. Budget time to refresh them quarterly, especially as user behavior shifts.
Groundedness scoring on long contexts is expensive and noisy. Chunk it.

How CodeNicely can help

If you're at the stage where the dashboard is green but the complaints are real, you need someone who has built this instrumentation against a regulated, high-stakes output — not a chatbot demo. Our work on HealthPotli involved an AI drug-interaction layer where a wrong answer is not a UX bug, it's a safety event. That engagement forced us to build sampled groundedness scoring, frozen eval sets, and drift alerts before the product could go live — the same pattern a Series B SaaS team needs once user trust is on the line.

If your AI feature is closer to a fintech or workflow context, the GimBooks work is a closer analog: structured-output reliability, prompt versioning, and regression gating on every change. Either way, see CodeNicely AI Studio for how we typically scope an observability retrofit on an existing AI feature.

Frequently Asked Questions

What is AI observability and how is it different from APM?

AI observability monitors the quality and behavior of model outputs — drift, hallucination, groundedness, prompt regression — in addition to standard infrastructure metrics. APM tells you the model responded; AI observability tells you whether the response was correct. Both are required; neither replaces the other.

How do I detect LLM drift in production?

Compute embeddings of incoming prompts, store a baseline distribution from a known-good window, and compare current windows using KL divergence or Wasserstein distance. Bucket by endpoint or user segment, not globally. Pair this with output-side signals like length distribution and refusal rate, which often shift before users notice quality issues.

What's the minimum AI monitoring checklist for a feature already in production?

At minimum: a versioned eval set run on every prompt change, sampled groundedness or correctness scoring on 1–2% of live traffic, user feedback signals (thumbs, regenerate, copy), token cost per request, and weekly input drift checks. Add retrieval recall metrics if you're using RAG. Skip drift dashboards until you have enough volume for them to be meaningful.

Should I use LLM-as-judge for production monitoring?

Yes, but sampled and async — not inline on every request. Inline judging doubles your cost and latency. Sample 1–2% of traffic, score it offline, and aggregate. Calibrate the judge model against periodic human review so you know its bias and false-positive rate.

How much does it cost to retrofit observability onto an existing AI feature?

It depends on your current stack, traffic volume, and how regulated the output is. Contact CodeNicely for a personalized assessment — we'll scope it against your actual architecture rather than a generic estimate.

Building something in SaaS?

CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.

Talk to our team