AI Observability Stack: What to Monitor and When
For: A CTO at a Series B SaaS company who shipped an AI feature six months ago, has no model failures on the dashboard, but is fielding increasing user complaints about output quality — and realizes they have no systematic way to know if the model is silently degrading
Your dashboards are green. Latency is fine, error rate is under 0.5%, the model returns 200s. And yet support tickets keep coming in: "the answer is wrong," "it made something up," "this used to work better." The gap between those two realities is where AI observability lives — and where most teams who shipped an LLM feature six months ago are flying blind right now.
This is a reference for what to monitor, why traditional APM misses it, and when each signal matters.
Why traditional APM is structurally blind to AI failure
Standard observability assumes a deterministic system. AI features are probabilistic. The output can be wrong while every infrastructure metric is healthy.
| APM tells you | APM cannot tell you |
|---|---|
| The model responded in 800ms | Whether the response was correct |
| Zero 5xx errors | Hallucination rate over the last 24 hours |
| Token throughput is stable | Whether inputs have drifted from training distribution |
| The prompt template loaded | Whether your last prompt edit regressed quality on edge cases |
| The vector DB is up | Whether retrieval is returning relevant chunks |
The four AI-specific failure modes you need to instrument
1. Drift (input and output)
Input drift: the distribution of user prompts has shifted away from what you tested or fine-tuned on.
Output drift: the model's responses have shifted in length, tone, structure, or sentiment over time.
- Track embedding distributions weekly. Compare current week to a baseline window using KL divergence or Wasserstein distance.
- Bucket by feature/endpoint, not globally. Drift hides in subpopulations.
- Watch output length distribution as a cheap canary — sudden shifts often precede quality regressions.
2. Hallucination rate
- For RAG: measure groundedness — what percentage of claims in the output are supported by retrieved context. Tools: Ragas, TruLens, or an LLM-as-judge with a rubric.
- For non-RAG: sample 1–2% of traffic for offline judge scoring. Don't try to score 100% in real time, the cost balloons.
- Set a threshold (e.g. groundedness < 0.85) and alert on the rolling 24h rate.
3. Prompt regression
Every prompt edit is a deploy. Treat it that way.
- Maintain an eval set of 50–500 examples per critical prompt. Run it on every prompt change before merging.
- Track per-version metrics: accuracy, format adherence, refusal rate, latency, token cost.
- Tools worth looking at: PromptLayer, Langfuse, Helicone, Braintrust, Arize Phoenix.
4. Feedback loop poisoning
If you fine-tune on user interactions or use thumbs-up/down to weight retrieval, bad signal compounds.
- Audit the training set before each fine-tune cycle. Check for adversarial inputs, duplicates, and class imbalance.
- Hold out a clean, frozen eval set the production data never touches.
- Track win rate of the new model vs. the previous one on that frozen set. If it doesn't beat the predecessor, don't ship.
The monitoring checklist by layer
| Layer | What to monitor | Frequency | Alert threshold |
|---|---|---|---|
| Infra | Latency p50/p95/p99, error rate, token throughput, rate limits | Real-time | Standard SRE thresholds |
| Cost | Tokens per request, cost per user, cache hit rate | Hourly | >20% week-over-week increase |
| Retrieval (RAG) | Recall@k, chunk relevance score, empty retrieval rate | Real-time | Empty retrievals >2% |
| Model output | Groundedness, format adherence, refusal rate, output length distribution | Sampled, hourly aggregate | Groundedness <0.85, refusal spike >2x baseline |
| Drift | Input embedding distribution vs. baseline | Daily | KL divergence above empirical threshold |
| User signal | Thumbs up/down rate, regenerate rate, copy rate, session abandonment | Real-time | Thumbs-down >1.5x rolling 7d |
| Safety | PII leakage, jailbreak attempts, toxicity flags | Real-time | Any confirmed leak = page |
When to add each signal
| Stage | Add these | Skip these for now |
|---|---|---|
| Pre-launch | Eval set, prompt version control, latency/error, basic safety filters | Drift detection, fine-tune feedback loops |
| 0–10k users | Sampled groundedness scoring, thumbs feedback, cost per user | Embedding drift dashboards |
| 10k+ users / multiple prompts | Drift detection, regression alerts, retrieval recall, judge model in CI | — |
| Post fine-tune | Frozen eval set, win-rate gating, training data audits | — |
Tooling: pick one stack, not five
- All-in-one LLM observability: Langfuse (open source), Arize Phoenix, Helicone, LangSmith, Braintrust.
- Eval frameworks: Ragas (RAG-specific), DeepEval, promptfoo, OpenAI Evals.
- Drift / ML monitoring: Evidently, WhyLabs, Arize.
- Tracing: OpenTelemetry GenAI semantic conventions are stabilizing — instrument with OTel if you want vendor portability.
Honest tradeoff: judge models cost real money and add latency if run inline. Sample. Run async. Don't try to score every request.
What this approach is bad at
- LLM-as-judge has its own bias and variance. Calibrate against human review periodically — don't trust it blindly.
- Drift metrics tell you something changed, not whether the change is bad. You still need humans in the loop for ambiguous cases.
- Eval sets go stale. Budget time to refresh them quarterly, especially as user behavior shifts.
- Groundedness scoring on long contexts is expensive and noisy. Chunk it.
How CodeNicely can help
If you're at the stage where the dashboard is green but the complaints are real, you need someone who has built this instrumentation against a regulated, high-stakes output — not a chatbot demo. Our work on HealthPotli involved an AI drug-interaction layer where a wrong answer is not a UX bug, it's a safety event. That engagement forced us to build sampled groundedness scoring, frozen eval sets, and drift alerts before the product could go live — the same pattern a Series B SaaS team needs once user trust is on the line.
If your AI feature is closer to a fintech or workflow context, the GimBooks work is a closer analog: structured-output reliability, prompt versioning, and regression gating on every change. Either way, see CodeNicely AI Studio for how we typically scope an observability retrofit on an existing AI feature.
Frequently Asked Questions
What is AI observability and how is it different from APM?
AI observability monitors the quality and behavior of model outputs — drift, hallucination, groundedness, prompt regression — in addition to standard infrastructure metrics. APM tells you the model responded; AI observability tells you whether the response was correct. Both are required; neither replaces the other.
How do I detect LLM drift in production?
Compute embeddings of incoming prompts, store a baseline distribution from a known-good window, and compare current windows using KL divergence or Wasserstein distance. Bucket by endpoint or user segment, not globally. Pair this with output-side signals like length distribution and refusal rate, which often shift before users notice quality issues.
What's the minimum AI monitoring checklist for a feature already in production?
At minimum: a versioned eval set run on every prompt change, sampled groundedness or correctness scoring on 1–2% of live traffic, user feedback signals (thumbs, regenerate, copy), token cost per request, and weekly input drift checks. Add retrieval recall metrics if you're using RAG. Skip drift dashboards until you have enough volume for them to be meaningful.
Should I use LLM-as-judge for production monitoring?
Yes, but sampled and async — not inline on every request. Inline judging doubles your cost and latency. Sample 1–2% of traffic, score it offline, and aggregate. Calibrate the judge model against periodic human review so you know its bias and false-positive rate.
How much does it cost to retrofit observability onto an existing AI feature?
It depends on your current stack, traffic volume, and how regulated the output is. Contact CodeNicely for a personalized assessment — we'll scope it against your actual architecture rather than a generic estimate.
Building something in SaaS?
CodeNicely partners with founders and tech teams to ship AI-native products that move metrics. Tell us about the problem you're solving.
Talk to our team_1751731246795-BygAaJJK.png)