LLM Observability Explained: Silent Failures & What to Track

Most engineering teams find out their AI agent is broken when a user complains about it. We see this constantly. By then the damage is done, sometimes days or weeks of bad outputs have already gone out. If you're currently relying on your APM tool to catch this stuff, it won't, and we'll show you exactly why.

Traditional Monitoring Has No Idea Your Agent Is Failing

LLM observability is continuous, end-to-end visibility into an AI model or agent's behavior across every interaction. It covers inputs and outputs, execution traces across model, tool, and retrieval steps, operational metrics like latency, token usage, and cost, and quality signals like hallucinations, relevance, and user frustration. It goes beyond uptime to answer whether the agent is doing the right thing.

That last part is the whole thing. Classical observability tools, logs, metrics, traces, were designed to answer binary health questions: is the service up, and how fast is it? They're excellent at detecting infrastructure failures. They cannot detect an LLM that's technically responding but producing low-quality, harmful, or hallucinated outputs. A 200 OK with a confidently wrong answer is completely invisible to Datadog or any infrastructure monitor. You need to internalize that distinction before any of the rest of this makes sense.

The instrumentation side of LLM observability is getting more standardized. OpenTelemetry's GenAI Semantic Conventions provide a vendor-neutral schema for structuring spans, metrics, and events across any GenAI system. But those standards handle the infrastructure layer, they don't classify quality signals. That's a different and harder problem.

78% of Agent Failures Leave No Trace at All

The number that matters: across 12 million agent logs we analyzed at Sentrial, approximately 22% of issues were tool call failures, something an agent ran into that made it stop. The remaining 78% were silent failures. No crash. No error code. No latency spike. The user just got a wrong or useless answer and left.

As LangChain has noted, an agent can have 99% uptime but still fail to follow user intent. Your APM dashboard shows healthy latency percentiles and low error rates while users report the agent confidently gave them wrong information. The gap is structural: APM tools instrument the execution layer, did the request succeed, how fast did it run, but have no semantic layer to answer whether the response was accurate or whether the user got what they needed. LLM agents fail in meaning, not in mechanism.

The three silent failure categories we see most often in production:

Hallucinations. The agent produces a confident, fluent answer that's factually wrong or contradicts the information it was given. Research confirms that LLM systems can produce syntactically correct output that is factually wrong, a failure mode with no analogue in conventional software.

User frustration. The user repeats their intent across multiple turns, rephrases the same question, or abandons the session without completing their task. No error fires. The session logs show a series of successful requests.

Agent forgetfulness. In multi-step workflows, the agent ignores or incorrectly references context from earlier in the conversation, producing responses that are locally coherent but globally wrong.

Without signals for these failure categories, teams discover problems through user complaints or manual audits, days or weeks after the fact. Mean time to detect and fix quality incidents is orders of magnitude longer when you're relying on crash signals alone.

The Four Layers You Actually Need to Instrument

Most observability guides list metric families. What they rarely explain is the instrumentation architecture, which layers to build, in what order, and what breaks if you skip one.

Layer 1: I/O. Capture prompts, completions, session IDs, and model and version metadata at every interaction. Without session IDs, you can't tell whether a user abandoned after turn 3 because the agent forgot context, you just see three successful 200s.

Layer 2: Execution trace. Capture spans across every planner call, executor step, tool call, and retrieval step, with parent-child relationships that preserve the causal structure of the run. As we've learned building Sentrial, when tool calls and branching behavior vary across runs, trace-and-end-state-only approaches stop being sufficient. You need the full path, not just the outcome.

Layer 3: Operational metrics. Latency per span, token usage, cost, throughput, and error rates. This is what traditional APM covers. It's necessary but not enough.

Layer 4: Quality and behavior signals. Hallucination signals, relevance and faithfulness scores, safety and PII flags, and user outcome signals. Most teams add this layer last and should plan for it earliest. It also requires classification, not just instrumentation, you can't write a threshold rule that catches "the answer was plausible but wrong."

For the minimum viable trace schema, every span should carry at least: session_id, run_id, step_type, input, output, latency_ms, model_version, and a failure_mode label field. OpenTelemetry's GenAI Semantic Conventions establish a standard vocabulary for the first three layers. The fourth requires a separate classification system built on top of your traces.

Each Silent Failure Mode Has a Distinct Signal Pattern

Treating all silent failures as one undifferentiated category leads teams to build monitoring that's too generic to catch anything specific. These failure modes look different in the data, and you have to instrument for each one separately.

Hallucinations appear when output contradicts retrieved context or known facts. The trace looks clean, retrieval ran, the model responded, but the output diverges from the source material. A case we worked through at Sentrial illustrates exactly how bad this can get: a Series B finance startup launched an agent in week two to handle vendor quotes. The agent was giving different quotes based on different PDFs and the prices seemed approximately right. What was actually happening: the agent wasn't ingesting the PDF properly, wasn't extracting the data, and was hallucinating the quote price based on other context like the RFP and customer metadata. Infrastructure monitoring showed nothing unusual. The company said this wouldn't have been caught "for a century" without semantic-level monitoring across every step of that workflow.

Frustration appears when a user repeats the same intent two or more turns in a row and the session ends without task completion. No individual request fails, the failure is only visible at the session level across multiple turns.

Forgetfulness appears when an agent references prior context incorrectly or ignores it entirely in multi-turn runs. Individual spans look fine. The failure only surfaces when you look at the relationship between turns.

Two less-obvious failure modes deserve attention because they're the ones most teams discover last:

Prompt injection and hijacked tool use. An agent follows malicious instructions embedded in retrieved content, completes its run successfully by every operational metric, and does exactly the wrong thing. Prompt injection is an observability problem: if you can't see how instructions entered the system, how they competed for influence, and where the model's behavior suddenly changed, you're debugging shadows.

Cascade degradation. A single weak retrieval step poisons downstream reasoning. No individual span shows an error. The failure is only detectable when you compare the retrieved context to the final output and notice the reasoning chain was built on bad inputs.

All of these require classification, not thresholding. This is where the semantic layer of LLM observability diverges architecturally from APM, and why the monitoring architecture matters as much as the metrics you track.

Sampling Is Exactly Why You're Missing the Failures That Matter

Here's the non-obvious misconception: teams believe sampling is sufficient for quality monitoring because it works well for infrastructure metrics. For agent quality, sampling isn't just imprecise, it's the source of the blind spot itself.

A hallucination pattern that affects 0.3% of sessions but targets your highest-value users will never appear in a 5% or 10% sample. Missing it isn't bad luck. It's the predictable, structural consequence of sampling a semantic signal that appears at low frequency in a long tail. In a web app, p99 latency is a reasonable proxy for user experience across the distribution. In an AI agent, a session can have perfect p99 latency and leave every user worse off than before. Sampling a metric that doesn't capture quality is sampling the wrong thing entirely.

The second misconception worth naming: LLM observability is just "adding evals." Evals are a subset, offline or CI-gated quality checks you run before deployment. Observability is the operational, production-time equivalent: continuous classification of live traffic, catching the failures that only appear when real users interact with your agent in ways you didn't anticipate. The best programs connect both, but they're not the same thing. Passing evals before launch tells you the agent can work. Observability tells you whether it is working.

This is the gap that shaped how we built Sentrial. Instead of sampling or LLM-as-judge systems, which are themselves subject to hallucination and become less accurate as agent complexity grows, we use post-trained classifiers fine-tuned on each customer's own agent traffic, running against every log. Not a sample. The classifiers are trained to the patterns of that specific customer's agent behavior, which is what makes them accurate enough to surface the 0.3% failures before they become business problems. If you want to dig into how the sampling gap plays out across specific tools, we compared the approaches in our LLM observability platforms roundup.

What Debugging Actually Looks Like When You Have This Data

Here's what a concrete incident looks like with and without proper LLM observability in place.

Scenario: a customer support agent starts showing a 12% week-over-week increase in session abandonment.

Without LLM observability: an engineering escalation kicks off. Someone pulls raw logs. There's no session grouping, so correlating turn 3 abandonment to a specific agent behavior requires manual reconstruction. It's unclear whether the problem is retrieval, model, prompt, or a tool call. Resolution time: days to weeks.

With LLM observability: a frustration signal spikes on sessions involving a specific tool call. Filtering to those sessions reveals a retrieval step returning stale context. The engineer replays from that retrieval step, forks the trace with a corrected retrieval configuration, and tests the fix without re-running the full agent. Full request tracing makes it possible to move from a detected issue to a clear explanation: the engineer can inspect a specific trace and see exactly what the model received, what it returned, and how long each step took. Resolution time: hours.

One of our Fortune 1000 customers runs custom Python and LangChain agents across supply chain, HR, and marketing workflows. After instrumenting with full session-level tracing and automated quality classification, their error rate dropped from 20% to under 10% in a single week, and they could see exactly which failure categories drove each point of improvement.

Where to Start and What Order to Do It In

Step 1: Instrument your agent boundaries. Capture session start and end, and wrap each planner call, executor step, tool call, and retrieval step as a span with parent-child relationships. OpenTelemetry is the standard here. If you're running LangChain or LangGraph, native instrumentation handles most of this automatically.

Step 2: Enrich every span. Add session_id, run_id, model_version, and step_type to every span from the start. This metadata is what makes traces queryable at session level rather than just request level.

Step 3: Route to a session-aware backend. A backend that stores and queries traces at request level only gives you half the picture. You need session-level aggregation to detect frustration, forgetfulness, and multi-turn failures.

Step 4: Add quality classification earlier than you think you need to. Your options here involve real tradeoffs. Rule-based heuristics like output length and refusal patterns are fast and cheap but miss nuance. LLM-as-judge approaches are flexible but expensive and subject to hallucination, which compounds as agent complexity grows. Fine-tuned classifiers trained on your own traffic are more accurate but require a platform to maintain them. Plan for which approach you'll use before you hit scale, not after.

On the decision: if your agent handles fewer than 1,000 sessions per month, start with OpenTelemetry instrumentation and manual review. Tools like Langfuse are a solid starting point for session-level tracing at that scale, we covered the tradeoffs directly in our Sentrial vs Langfuse comparison. If you're at scale with real users and real business risk, the sampling gap becomes your biggest observability liability. That's the threshold where Sentrial is worth evaluating, five lines of instrumentation via OpenTelemetry, LangChain, LangGraph, or custom Python, with full-log classification running against every session from day one.

AI agents degrade over time through data drift, concept drift, model drift, and tool and API changes. The teams that catch it earliest are the ones who built the quality layer into their observability stack before they needed it.

Frequently Asked Questions

What is meant by LLM observability?

LLM observability is continuous, end-to-end visibility into an AI model or agent's behavior across every interaction. It covers inputs and outputs, execution traces across model, tool, and retrieval steps, operational metrics like latency and cost, and quality signals like hallucinations, user frustration, and goal abandonment. It answers not just whether your agent is running but whether it's doing the right thing. Traditional monitoring answers "is it up", LLM observability answers "is it working."

How is LLM observability different from traditional application monitoring?

Traditional APM detects infrastructure failures, crashes, timeouts, latency spikes. LLM observability detects semantic failures, wrong answers, hallucinations, users abandoning sessions without completing their goal. The gap is structural: APM instruments the execution layer and has no mechanism for evaluating whether a response was accurate or helpful. Across 12 million agent logs we analyzed, approximately 78% of failures were silent, no error code, no latency spike, and completely invisible to traditional monitoring tools.

What are the most important metrics to track for LLM observability?

At the operational layer: latency per span, token usage, cost, and error rates. At the quality layer: hallucination rate, task completion rate, user frustration signals like repeated rephrasing and session abandonment, and tool call failure rate. The operational metrics tell you the agent is running. The quality metrics tell you whether it's working. Most teams instrument the first category well and underinvest in the second until a production incident forces the issue.

What are the 5 pillars of LLM observability?

The most complete framework covers five areas: (1) tracing, capturing inputs, outputs, and execution paths across every model, tool, and retrieval step; (2) operational metrics, latency, token usage, cost, and throughput; (3) quality evaluation, hallucination detection, relevance, faithfulness, and safety signals; (4) alerting, real-time notification when quality or operational signals cross defined thresholds; and (5) debugging, the ability to pinpoint the specific step in a multi-step agent run where a failure originated and replay from that point. Most tools cover one or two of these well. Full coverage requires all five.

Do I need LLM observability if I'm already using Datadog or another APM tool?

Yes. Traditional APM tools are adding LLM tabs that track tokens and latency alongside infrastructure metrics, but none of them answer the question that actually matters for AI agents: is your AI producing good outputs? Datadog catches latency spikes and 500 errors. It won't alert you when your agent hallucinates a vendor quote, loses context in a multi-turn workflow, or drives users to abandon sessions. For AI agents in production, your APM tool is a necessary baseline, not a sufficient safety net. For a direct comparison, see our article on Datadog alternatives for teams whose agents fail silently.

Most AI Agent Failures Never Trigger an Alert. Here's Why.

Traditional Monitoring Has No Idea Your Agent Is Failing

78% of Agent Failures Leave No Trace at All

The Four Layers You Actually Need to Instrument

Each Silent Failure Mode Has a Distinct Signal Pattern

Sampling Is Exactly Why You're Missing the Failures That Matter

What Debugging Actually Looks Like When You Have This Data

Where to Start and What Order to Do It In

Frequently Asked Questions

Try Sentrial

Try Sentrial

Most AI Agent Failures Never Trigger an Alert. Here's Why.

Traditional Monitoring Has No Idea Your Agent Is Failing

78% of Agent Failures Leave No Trace at All

The Four Layers You Actually Need to Instrument

Each Silent Failure Mode Has a Distinct Signal Pattern

Sampling Is Exactly Why You're Missing the Failures That Matter

What Debugging Actually Looks Like When You Have This Data

Where to Start and What Order to Do It In

Frequently Asked Questions

Try Sentrial

Datadog vs Dynatrace Can't Tell You When Your AI Agent Is Wrong

Dynatrace Alternatives in 2026 That Actually Fit Your Use Case

Langfuse Is Good at Tracing. Here's Where It Stops.

Try Sentrial