LLM Monitoring Explained: Traces, Evals, Alerts, and Debugging

Most LLM monitoring guides stop at metric lists. This one gives you the full failure-to-fix loop: trace, assert, alert, replay.

N

Neel Sharma

May 31, 20269 min read

LLM monitoring is the continuous practice of capturing what an LLM application does in production (inputs, outputs, latency, token costs, tool calls), evaluating whether those outputs meet quality and safety thresholds, and triggering actionable alerts when they don't. Observability is the property; monitoring is the practice built on top of it. And unlike traditional APM, LLM monitoring must extend into output quality because a wrong answer returns HTTP 200 just like a right one.

Most explainers hand you a metric taxonomy and stop there. This one walks through the full operational loop: trace every step, assert against every interaction, alert with code-level context, and replay from the exact step that broke. If you want the four-layer architectural model that underpins all of this, our LLM observability deep dive covers that foundation.

Why LLM Monitoring Is Harder Than Traditional Monitoring

LLM monitoring is categorically harder than traditional APM because of three compounding problems: non-determinism, multi-step failure propagation, and silent quality degradation. A crash fires an alert. A confident hallucination does not. Nothing in your existing observability stack has a concept of "wrong answer," which means the entire failure class is invisible by default.

Non-determinism is the first wall teams hit. As Sendbird notes, AI agents' outputs are inherently non-deterministic, so the same input can produce different outputs at different times. That makes rule-based detection nearly useless for quality failures.

Multi-step propagation is harder. As one analysis of production AI agents describes it, an agent can complete a task with a confident, well-formatted output while getting the answer completely wrong, or misunderstand an instruction on step two and silently propagate that error across twenty downstream steps. If you only log the final response, you have no idea where things went sideways.

The silent quality problem is what we've watched burn teams at scale. We've analyzed 12 million production logs across customers, and around 78% of failures were not clean errors or timeouts. The user simply got a wrong or useless answer and left. As Respan puts it, without a separate fact-checking layer or traces showing retrieval failures, there's no signal that anything went wrong.

The third compounding factor is sampling. Most monitoring pipelines inspect a slice of traffic, 10% or 20%. With rare failure modes like a jailbreak variant or a tool-call edge case that only surfaces in step 5+ of a long chain, you're likely inspecting 0.03% of the actual failures. Full-coverage classification isn't a premium feature; it's the minimum for reliable regression detection.

The Three Signal Layers You Need to Monitor

Effective LLM monitoring requires three distinct signal layers: operational (latency, tokens, tool call rates), quality (hallucinations, goal completion, instruction-following), and safety (jailbreaks, prompt injection, data leakage). Most teams instrument layer one and skip two and three, which is exactly where the costly failures live.

Layer 1: Operational signals. Latency per step (not just per session), token count and cost per step, error rates, and tool call success/failure rates. The per-step granularity matters: if you only track session-level token cost, you can't identify which retrieval step or tool invocation is ballooning your bill.

Layer 2: Quality signals. Hallucination assertions, retrieval relevance scores, goal completion rates, and instruction-following fidelity. This is the layer most teams under-instrument. As Braintrust observes, traditional APM tools measure infrastructure health but cannot evaluate the quality of LLM output. Quality signals require automated evaluations running against every interaction, not a dashboard of p99 latency.

From the 12 million logs we've analyzed, hallucinations ranked as the top failure category, user frustration second, and agent forgetfulness third. Pure tool-call error tracking doesn't touch any of those three.

Layer 3: Safety and risk signals. Jailbreak detection, prompt injection, toxicity, and data leakage. The key point: safety signals need the same assertion pipeline as quality signals, not a separate manual review workflow. A jailbreak that returns HTTP 200 with a polished response is invisible unless you have a classifier looking for it.

What surprises most teams is the scope of what's classifiable. A finance company we worked with instantiated a mismatched GL codes classifier. You'd think that's trackable with a simple start/end-state check, but agents can take dozens of different paths to a result, and even the end states have hundreds of variations. When a GL code isn't in the system at all, the agent might not output a code at all. That kind of structured monitoring across varied intermediate behavior requires a proper classification layer, not a regex.

How a Production LLM Monitoring Workflow Actually Runs

A production LLM monitoring workflow runs in five steps: instrument every intermediate step with spans, run automated quality and safety assertions on every interaction, gate prompt changes with statistically rigorous A/B testing, route alerts with code-level context rather than vague degradation notices, and debug by replaying from the exact failing step. Each step depends on the previous one.

Step 1: Instrument every intermediate step. Not just the final output. Emit spans for prompt construction, retrieval, each tool invocation, model call, and output parsing. OpenTelemetry is the standard instrumentation layer here, providing a single standardized way to handle observability data. Distributed tracing connects these spans into a trace graph so you can identify which specific model call, tool invocation, or reasoning step caused a regression. Setup can be as lightweight as five lines of instrumentation code on your agent file.

Step 2: Run automated evaluation assertions on every interaction. Start with built-in classifiers for the failures you already know about: hallucinations, bad tool calls, goal abandonment, user frustration. Add custom classifiers for domain-specific failure modes as you learn what actually breaks in your application. The custom classifiers matter as much as the built-ins; production agents fail in ways you didn't anticipate when you wrote your eval suite.

Step 3: Gate prompt changes with A/B testing. Before rolling out a prompt change to full traffic, test it against live production sessions with statistical rigor. We've covered the mechanics of this in detail in our prompt A/B testing article. The short version: manual A/B tests and clean eval runs don't tell you what actually improves agent performance in production. Real traffic does.

Step 4: Route alerts with code-level context. An alert that says "quality degraded" is nearly useless at 2 AM. An alert that says "classifier: hallucination, trace ID: XYZ, step 4 of 7, source: agent.py line 112, suggested fix" is actionable immediately. The difference is whether your alerting layer carries the full trace context or just a metric threshold crossing. At Sentrial, alerts route to Slack with source-code-level failure pinpointing and fix suggestions, so the on-call engineer can act without digging through logs manually.

Step 5: Debug by replaying or forking from the failing step. Re-running the entire session to reproduce a failure is slow and often non-reproducible given non-determinism. Replaying from the exact intermediate step that broke, or forking from that step to test a fix, cuts debugging time dramatically. Unified traces are what make this possible: they give you the exact state at each step, so you're not reconstructing from logs.

Full log coverage is what makes the entire assertion pipeline trustworthy. A sampled eval can miss the tool-call failure that surfaces in 0.3% of sessions. Classifying every interaction means the detection math actually works.

Common Misconceptions That Burn Teams in Production

Three beliefs cause the most production pain, and none of them are obvious until you've already paid for them.

"If evals pass in CI, the model is fine in production." Eval datasets go stale. Production drift, new user intents, updated tools, prompt version changes, creates failure modes your eval set never anticipated. As VentureBeat notes, evaluation datasets suffer from concept drift as user behavior evolves, making static datasets increasingly outdated. Passing evals last Tuesday and hallucinating today are not contradictory; they're the norm. We've covered this in depth in our article on what evals actually miss.

"Monitoring a 10% sample is good enough." This is the non-obvious one, and it's the one that bites teams hardest. Rare but severe failure modes appear at frequencies below typical sampling rates. A 10% sample with a 0.3% failure rate means you're actually catching about 0.03% of real failures. This isn't a statistical edge case; it's the failure mode that only surfaces when the agent reaches step 5+ of a long chain, or when a specific user phrasing triggers a jailbreak variant. Full-coverage classification isn't a premium upgrade; it's what makes monitoring reliable at all. Model providers like OpenAI and Anthropic also modify model behavior without notice, and a sampled pipeline may never surface that drift until users start complaining.

"Latency and error rate cover the risk." An LLM application can have perfect operational metrics while confidently hallucinating on every third response. Across our 12 million analyzed logs, 78% of failures were silent quality failures, not errors or timeouts. Quality signals are not derivable from operational telemetry. They require a separate evaluation layer running against every interaction.

At Sentrial, our approach to all three is the same: classify every interaction (not a sample), run built-in and custom classifiers across the full log volume without LLM-per-call costs, and maintain session-level traces that make replay from any intermediate step possible. The classifiers deploy in under a minute from three or four example logs, so you're not waiting on a training pipeline every time a new failure pattern surfaces.

How to Get Started with LLM Monitoring in 2026

Start with instrumentation, layer in evaluations, wire up alerting, and close the loop with A/B testing and replay. Don't try to do all four at once. Each phase delivers value independently, and the later phases only work if the earlier ones are solid.

Phase 1: Instrument for full step-level visibility. Instrument with OpenTelemetry or a native SDK. Capture inputs, outputs, latency, and token cost at every step, not just the session level. If you don't have spans for each tool invocation and model call, phases 2 through 4 are operating blind.

Phase 2: Add automated quality assertions. Start with built-in classifiers for the most common failures: hallucinations, tool call errors, goal abandonment. Add custom classifiers for your domain-specific failure modes. The custom classifier workflow matters here; you want to be able to deploy a classifier from a few example logs, not retrain a model. At Sentrial, teams can instantiate a custom classifier in under a minute by checking three or four example logs. The classifier deploys as a lightweight model, not an LLM-as-judge call, which means you can run it across millions of logs without a matching Anthropic bill.

Phase 3: Wire alerts to your incident workflow. Slack or PagerDuty with enough context that the on-call engineer can act without digging through logs. The alert needs the trace, the failing step, the classifier evidence, and ideally a suggested fix. A vague quality degradation notice is not an actionable alert.

Phase 4: Close the loop. Use production monitoring data to update your eval dataset, validate prompt changes with statistically rigorous A/B testing on live traffic, and replay or fork from failing sessions during debugging. At this point, monitoring is your fastest feedback loop, not a cost center.

Sentrial covers all four phases in a single platform, with integrations for OpenTelemetry, LangChain, LangGraph, and custom Python agents. If you're in the process of comparing options, our LLM observability platform review covers the tools worth considering in 2026 and what most of them miss.

FAQ

What is the difference between LLM monitoring and LLM observability?

Observability is the property of a system that lets you understand its internal state from its external outputs. Monitoring is the active practice of watching, evaluating, and alerting on that state. In LLM terms: observability gives you the traces and telemetry; monitoring is what you do with them. Our LLM observability article covers the four architectural layers in detail.

What metrics should you monitor for LLM applications?

Three layers: operational (latency per step, token cost per step, tool call success rate, error rate), quality (hallucination rate, goal completion, instruction-following fidelity, retrieval relevance), and safety (jailbreak detection, prompt injection, toxicity). Most teams instrument layer one and skip two and three, which is exactly where the costly silent failures live.

How do you monitor hallucinations or answer quality in production?

With automated assertion classifiers running against every interaction. Human review doesn't scale to production log volumes, and LLM-as-judge pipelines lose accuracy as agents grow more complex and run for longer chains. The practical approach is fine-tuned classifiers that run on every log, not a sample, with alerts routed when the hallucination rate crosses a threshold. From our 12 million logs analyzed, hallucinations are the single largest failure category, and almost none of them produce errors that traditional monitoring would catch.

How can OpenTelemetry be used for LLM monitoring?

OpenTelemetry provides the instrumentation layer: spans, events, and attributes that capture what happened at each step in an agent run. It gives you the traces. What it doesn't include is evaluation (was this output correct?) or alerting (what should fire when it isn't?). For LLM monitoring, OTel is the foundation, and you need a layer above it for quality assertions, classifiers, and alert routing. Most production setups use OTel for instrumentation and a purpose-built LLM observability platform for everything above that.

What should an LLM monitoring stack include end-to-end?

Five components: step-level tracing (inputs, outputs, latency, token cost at every intermediate step), automated quality and safety classifiers running on full log coverage, prompt A/B testing with statistical rigor for validating changes before rollout, real-time alerting with trace context and code-level failure pinpointing, and replay or fork capability for debugging from the exact failing step. Most tools cover one or two of these. The gap that burns teams is usually the missing integration between evaluation results and the debugging workflow.

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started

Share

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started