LLM Observability Has Four Layers. Most Teams Only Build One.

LLM observability catches what APM misses: hallucinations, user frustration, and the 78% of agent failures that never throw an error.

N

Neel Sharma

May 25, 202610 min read

What is LLM observability?

LLM observability is the continuous, end-to-end visibility into an AI model or agent's behavior across every interaction. It covers inputs and outputs, execution traces across model, tool, and retrieval steps, operational metrics like latency and token cost, and quality signals like hallucinations, relevance, and user frustration. Unlike traditional monitoring, it answers not just "is the system up?" but "is the agent doing the right thing?"

That last question is the one your APM dashboard cannot answer.

As LangChain notes, an agent can have 99% uptime but still fail to follow user intent. Your dashboard shows healthy latency percentiles and low error rates while users report that the agent confidently gave them wrong information. That gap, between operational health and behavioral quality, is what LLM observability exists to close.

Why Traditional Monitoring Misses Most AI Failures

Traditional APM tools catch crashes, timeouts, and latency spikes. For AI agents, those represent a small minority of what actually goes wrong.

We've analyzed 12 million logs across production agent deployments, and around 22% of issues were explicit tool call failures, the kind that make an agent stop. The remaining 78% were silent: hallucinations, user frustration, and agent forgetfulness. The user just gets a wrong or useless answer and leaves. No alert fires.

This isn't a tooling gap that Datadog will eventually fill with a new tab. It's a structural one. Classical observability tools were designed to answer binary health questions ("Is the service up?" and "How fast is it?") and excel at detecting infrastructure failures. They cannot detect an LLM that's technically responding but producing hallucinated outputs. The execution layer succeeded. The semantic layer failed.

The three failure modes that fall through the APM gap are:

  • Hallucinations: confident wrong answers, often contradicting retrieved context or known facts
  • User frustration: repeated rephrasing, session abandonment, rage-quits after the agent fails to understand intent
  • Forgetfulness: the agent ignores or misremembers prior context across turns in multi-step workflows

When teams rely on crash signals alone, they find out about these failures through user complaints or manual audits, sometimes weeks after the regression started. Mean time to detect a quality incident is orders of magnitude longer than mean time to detect a 500 error.

The Four Layers You Actually Need to Instrument

Most observability guides list metric families. This is what the instrumentation architecture actually looks like in practice.

Layer 1: I/O layer. Capture prompts, completions, session IDs, and model/version metadata on every call. Without session IDs, you can't group turns into a conversation. Three successful 200s look fine; the session where a user rephrased the same question three times and then quit is invisible.

Layer 2: Execution trace layer. Spans across every planner call, executor step, tool invocation, and retrieval operation, with parent/child relationships intact. This is where agent non-determinism lives. When tool calls and branching behavior vary across runs, "trace and end-state only" approaches stop being sufficient. You need the intermediate steps because that's where failures compound.

Layer 3: Operational metrics layer. Latency per span, token usage, cost per run, throughput, and error rates. This is what OpenTelemetry and OpenInference handle well. The OpenTelemetry GenAI Semantic Conventions establish a standard schema for tracking prompts, model responses, token usage, tool calls, and provider metadata, and they reached stable status for core span attributes in 2024. Use them. They give you a vendor-neutral foundation.

Layer 4: Quality/behavior layer. Hallucination signals, relevance and faithfulness scores, safety and PII flags, user outcome signals. This is the layer most teams add last and should plan for first. OpenTelemetry handles the infrastructure side. It does not classify quality signals. That is a different problem entirely.

A minimum viable trace schema looks like this: session_id, run_id, step_type, input, output, latency_ms, model_version, and a failure_mode label field. The label field is the one most teams omit, and it's the only field that lets you query "show me all sessions where the agent hallucinated."

Silent Failures Are a Taxonomy, Not a Catch-All

Naming silent failures as a category is useful. Knowing what their traces actually look like is what lets you catch them.

Hallucinations show up when an agent's output contradicts retrieved context or known facts. The dangerous pattern isn't the obvious hallucination; it's the plausible one. We worked with a Series B finance startup using agents to automate accounts receivable. Their agent took in vendor PDFs, extracted data, and computed quotes. In week two of launch, it looked fine: different quotes for different PDFs, approximately the right prices. But it wasn't ingesting the PDFs properly. It was hallucinating the quote based on the RFP and surrounding customer data, not the actual document. From every operational metric, the agent was succeeding end-to-end. It would not have been caught, as the team put it, "for a century" without explicit step-level monitoring.

User frustration is observable in session patterns: the user restates the same intent two or more times, the session ends without task completion, the user's message length increases (a signal of escalating rephrasing). None of these patterns are visible in request-level logs.

Forgetfulness appears in multi-turn traces as the agent referencing prior context incorrectly or ignoring information already established in the session.

Two failure modes that get less attention but matter just as much:

Prompt injection. An agent follows malicious instructions embedded in retrieved content and completes the run successfully by every operational metric, but does the wrong thing. This is an observability problem: if you can't see how instructions entered the system and where the model's behavior changed, you're debugging shadows. Prompt injection attempts leave fingerprints in traces. A system that knows what to look for can flag suspicious patterns before they become incidents.

Cascade degradation. A single weak retrieval step poisons downstream reasoning. No individual span shows an error. The agent finishes. The answer is wrong. This is only visible at the session level when you can see how a bad intermediate output propagated forward.

This is why silent failures require classification, not thresholding. You cannot write a latency or error-rate rule that catches "the answer was plausible but wrong." The semantic layer of LLM observability is architecturally different from APM, not just additively different.

The Misconception That Keeps Teams Flying Blind

The non-obvious one: most teams believe sampling is sufficient for quality monitoring because it works well for infrastructure metrics. For agent quality, sampling doesn't just reduce coverage, it systematically excludes the failures that matter most.

Rare but high-impact silent failures, say a hallucination pattern affecting 0.3% of sessions but concentrated among your highest-value enterprise users, will never appear in a 5% or 10% sample. You won't miss them by bad luck. You'll miss them because the architecture guarantees it.

In a traditional web app, p99 latency is a reasonable proxy for user experience. In an AI agent, a session can have perfect p99 latency and leave every user worse off than before they started. Sampling a metric that doesn't capture quality means you're sampling the wrong thing at high fidelity.

The second misconception worth naming: LLM observability is just "adding evals." Evals are a subset, offline or CI-gated quality checks that test what your agent can do. Observability is the operational equivalent: continuous classification of live production traffic, not a test suite you run before deployment. As one framing puts it, evaluation determines if the AI agent can work; observability determines if it is working. The distinction matters because most issues pop up in production in ways you couldn't predict or account for beforehand.

This is the design principle behind Sentrial. Rather than sampling logs through a generic LLM-as-judge approach, Sentrial runs post-trained classifiers fine-tuned on each customer's own agent traffic against every log, not a sample. That full-coverage classification is one part of a complete observability stack that also includes session-level tracing, automated evaluations, prompt A/B testing with statistical rigor, real-time Slack alerts, and source-code-level failure pinpointing with fix suggestions. The accuracy difference on classification matters because LLM-as-judge systems are themselves subject to hallucination, and with agents running hundreds of tool calls over sessions lasting hours, you can't reach the same classification accuracy with a generic judge that you get with a model trained on the specific patterns of that customer's traffic.

What a Real Debugging Workflow Looks Like

Consider a concrete scenario that plays out in 2026 across teams running production customer support agents: session abandonment rises 12% week-over-week, and nobody knows why.

Without LLM observability. Engineering escalation, manual log sampling, no session-level grouping. The team debates whether the problem is retrieval, the model, the prompt, or a tool. Because each request looks healthy in isolation, there's no way to see the pattern. Resolution time: days to weeks.

With it. A frustration signal spikes on sessions involving a specific tool call. Filter to those sessions. The traces show retrieval returning stale context from a cache that wasn't invalidating correctly. A Fortune 1000 customer we work with, running custom Python and LangChain agents across supply chain, HR, and marketing workflows, went from a 20% error rate to under 10% in a single week once they had this kind of session-level visibility. The fix wasn't a model change. It was a retrieval bug that was invisible at the request level.

The pattern here matters more than the specific tool. Full request tracing makes it possible to move from a detected issue to a clear explanation: engineers can inspect a specific trace and see exactly what the model received, what it returned, and how long each step took. Sentrial's ability to replay and fork from any intermediate step in an agent run, so you can branch from the retrieval step and test a fix without re-running the full agent, turns what used to be a multi-day investigation into a hours-long one.

How to Get Started with LLM Observability

Start with instrumentation, not evaluation. Most teams do it backwards.

Step 1. Instrument your agent boundaries. Every session start and end, every planner call, executor step, tool invocation, and retrieval operation should be a span. Use OpenTelemetry or a framework-native library (OpenInference works well for Python agents; LangChain and LangGraph both have native tracing hooks).

Step 2. Enrich every span with session_id, run_id, model_version, and step_type. These four fields are the minimum for session-level analysis. Without them, you have logs, not observability.

Step 3. Route traces to a backend that can query at session level, not just request level. This matters the moment you need to ask "which sessions included this specific retrieval step and ended in abandonment?"

Step 4. Plan the quality layer early, even if you implement it later. Your options:

Approach Accuracy Cost Coverage
Rule-based heuristics Low Minimal Full
LLM-as-judge (sampled) Medium Medium Partial
LLM-as-judge (full coverage) Medium High Full
Fine-tuned classifiers (per-customer) High Low per log Full

Heuristics miss nuance. LLM-as-judge at full coverage gets expensive fast and is itself subject to the hallucination problem. Fine-tuned classifiers trained on your own traffic are more accurate but require a platform built around that approach.

Decision heuristic. If your agent handles fewer than 1,000 sessions per month, start with OpenTelemetry instrumentation and manual review. The sampling gap isn't yet your biggest problem. If you're at scale with real users and business risk attached to agent output, the gap between what sampling shows and what's actually happening becomes your primary observability liability.

At that scale, Sentrial is worth looking at. It is a production monitoring platform for AI agents that covers the full observability stack: session-level tracing with inputs, outputs, latency, and token costs at every step; automated evaluations with built-in classifiers for hallucinations, bad tool calls, agent forgetfulness, and jailbreaking, plus custom classifiers you can instantiate from a few sample logs in under a minute; prompt A/B testing with statistical rigor; real-time Slack alerts on error spikes and behavioral anomalies; and source-code-level failure pinpointing with fix suggestions. Setup is five lines of instrumentation on top of OpenTelemetry, LangChain, LangGraph, or custom Python agents. Coverage is every log, not a slice of them.

FAQ

What is meant by LLM observability?

LLM observability is continuous, end-to-end visibility into an AI model or agent's behavior across every interaction. It covers the inputs and outputs at each step, execution traces across model calls, tool invocations, and retrieval operations, operational metrics like latency and token usage, and quality signals like hallucinations and user frustration. The key distinction from traditional monitoring is that it measures whether the agent is doing the right thing, not just whether it's running.

How is LLM observability different from traditional application monitoring?

Traditional APM tools detect infrastructure failures: crashes, timeouts, latency spikes, and error codes. They instrument the execution layer. LLM observability adds a semantic layer that traditional APM has no concept of: was the response accurate? Did the user get what they needed? A 200 OK with a confidently wrong answer is invisible to Datadog. Across 12 million production agent logs we've analyzed at Sentrial, 78% of failures were silent, generating no error signal that any APM tool would catch.

What are the most important metrics to track for LLM observability?

Track across four layers: (1) I/O, prompts, completions, session IDs, model version; (2) execution traces, spans for every planner, tool, and retrieval step with parent/child relationships; (3) operational metrics, latency per span, token usage, cost per run; (4) quality signals, hallucination rate, task completion rate, user frustration indicators, and any domain-specific failure modes relevant to your agent. Most teams implement layers one through three and treat layer four as optional. It shouldn't be. Layer four is where 78% of real failures live.

What are the five pillars of LLM observability?

The commonly cited pillars are: traces (end-to-end execution visibility across every agent step), metrics (latency, token usage, cost, throughput), logs (raw input/output records), evaluations (quality scoring against ground truth or rubrics), and feedback signals (user ratings, session outcomes, task completion). In practice, traces and evaluations are the two that differentiate LLM observability from traditional APM. The others exist in some form in standard monitoring stacks; the quality and behavioral layers do not.

Do I need LLM observability if I'm already using Datadog or another APM tool?

Yes, if your agents handle real users and real business decisions. As Confident AI notes, traditional APM tools are adding LLM tabs that track tokens and latency, but none of them answer the question that actually matters: is your AI producing good outputs? Datadog catches latency spikes and 500 errors. It will not alert you when your agent starts hallucinating on a specific input class, or when users start abandoning sessions at a specific step. Those are the failures that matter most, and they require a different layer of instrumentation than your existing APM provides. Sentrial is built specifically to cover that gap, combining tracing, automated evaluations, A/B testing, real-time alerting, and code-level debugging in a single platform that integrates directly on top of your existing OpenTelemetry setup.

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started

Share

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started