Best Practices for Agentic AI Observability in Production

Agentic AI observability isn't just what telemetry you capture, it's the full loop from trace to alert to fix. Here's how it actually works.

N

Neel Sharma

May 31, 202612 min read

Agentic AI observability is end-to-end visibility into what a multi-step AI agent did, whether it did it correctly, and exactly where it went wrong. It covers session-level traces, automated quality evaluations, real-time alerts, and code-level debugging in one connected loop. Unlike traditional monitoring, which asks "did the service stay up?", agentic observability asks "did the agent reach the right goal, through the right steps, at an acceptable cost?" Those are fundamentally different questions, and most teams are still trying to answer the second one with tools built for the first.

For deeper background on the broader observability stack, see our LLM observability explainer.

Why Legacy Monitoring Can't See What Agents Are Doing

Traditional APM catches infrastructure failures, CPU spikes, memory leaks, elevated error rates, slow response times. An agent that returns a hallucinated answer with a 200 OK response is invisible to those tools. The service looks healthy because it is healthy; the agent just gave the wrong answer.

There are three failure categories that evade legacy monitoring entirely. Wrong answers: hallucinations, factual drift, confident-sounding responses that are simply incorrect. Wrong process: bad tool calls, goal abandonment, an agent that forgets context from step three by step nine in a long session. Wrong cost: token waste, runaway retry loops that never surface as exceptions but quietly burn budget. All three can coexist in a run that Datadog marks green.

From our analysis of 12 million production logs, around 78% of agent failures were not clean errors or timeouts. The majority were silent: the user got a wrong or useless answer and left. Hallucinations were the top failure category, followed by user frustration, followed by agent forgetfulness or laziness. Only about 22% of issues were explicit tool call failures that actually stopped a run. The rest slipped through every infrastructure alert, every error rate threshold, every latency dashboard.

A concrete example: one Series B finance startup launched a vendor quoting agent that ingested PDFs, extracted line items, and computed quotes. It looked fine in production for weeks. Different PDFs produced different quotes; prices seemed approximately right. What no one noticed was that the PDF ingestion was broken. The agent was hallucinating quotes based on RFP context and customer metadata, completely bypassing the actual document. No exceptions. No error codes. No latency anomalies. Just consistently wrong numbers going out to vendors.

As Arize notes in their field analysis, your existing observability stack reports "Success" because the HTTP status code was 200. And as InsightFinder describes, retrieval failures are often silent: latency stays acceptable, error rates stay low, and the system appears healthy from the outside.

For a deeper look at the silent failure taxonomy, see Your Agent Isn't Crashing. It's Lying.

The Four Signal Types Agentic Observability Requires

Agentic observability requires four distinct signal types: session-level traces, interaction logs with tool call payloads, cost and latency metrics per step, and quality evaluation metrics like hallucination rate and goal completion rate. Capturing only the first three leaves you with a complete picture of what the agent did and a blank on whether any of it was correct.

The MELT framework (Metrics, Events, Logs, Traces) maps to agent-specific artifacts differently than it does for traditional services:

  • Traces/Spans cover the full session journey from first user message to final output, with each decision step as a child span. Every intermediate state transition, every branch the agent took, every tool it invoked.
  • Logs/Events capture LLM calls and tool invocations including inputs, outputs, latency, and failure codes. Not just "tool_call: search" but what the agent sent to the tool and what came back.
  • Metrics track token usage, cost per step, latency per step, and retry counts. These are the signals teams miss when they instrument with generic APM alone.
  • Quality/Evaluation metrics are where most teams fall short. Hallucination rate, tool failure rate, goal completion rate, safety flags, these require running automated evaluations on top of the raw traces, not just collecting them.

The fields teams most often miss when instrumenting with generic OpenTelemetry: token usage and cost at the span level, intermediate state transitions between steps, tool call result payloads (not just whether the call succeeded), and eval assertion results tied to a specific session.

One critical requirement that rarely gets mentioned: every signal across a multi-step run must share a single session or execution ID. In multi-agent workflows, where one orchestrator hands off to specialist agents, this correlation anchor is the difference between debugging a failure in minutes and spending an hour manually reconstructing which agent touched which data at which step.

OpenTelemetry's semantic conventions for Generative AI focus on traces, metrics, and events together as a framework. The OpenTelemetry GitHub working group is actively standardizing attributes for tracing tasks, actions, agents, teams, artifacts, and memory across complex AI workflows. LangChain and LangGraph already emit OTel-compatible traces; at Sentrial, we ingest via OTel alongside our native framework integrations, then run classification and clustering on top.

The important distinction: OTel gives you the wire format and the spans. It doesn't tell you whether the output in those spans was any good. That's the job of the evaluation layer.

Why Sampling Breaks Agentic Observability

Sampling strategies that work fine for infrastructure monitoring are dangerous for agents. Rare-but-critical failure modes, goal abandonment on a specific intent pattern, jailbreak attempts, tool misuse on edge-case inputs, occur in the tail of the distribution and get dropped by any reasonable sample rate.

The math makes this concrete. If you sample 10% of production runs and hallucinations occur in 2% of sessions, you're classifying roughly 0.2% of traffic. That's statistically meaningless for regression detection or rate alerting. You might run for weeks after a prompt change silently lifted your hallucination rate from 2% to 6% before enough sampled sessions accumulate to detect it. OpenObserve's analysis of sampling strategies makes this explicit: if errors represent 0.1% of traffic and your sampling rate is 5%, you're keeping only 0.005% of your error traces. For a high-volume system, that's roughly one error trace every few minutes. Logz.io's tracing guide puts it plainly: rare events that matter for incident flagging may simply not be sampled.

For agent observability, this isn't a theoretical concern. The failures teams most urgently need to detect, the specific intent pattern that triggers goal abandonment, the input format that breaks tool parsing, the session length that causes context collapse, are exactly the kind of low-frequency events that sampling drops.

Full-log coverage changes the operational calculus entirely. When you classify every interaction rather than a sample, behavioral regressions surface immediately. A prompt change that degrades performance on a narrow intent type shows up within the first few hundred affected sessions, not after enough volume survives the sample filter to hit significance. That's not a "more data" preference; it's an operational reliability choice. The question isn't whether you want more coverage. It's whether you want to know about a regression before or after your users do.

Closing the Loop: From Trace to Fix

Collecting telemetry is the easy part. The operational loop that turns collected signals into detected failures, fired alerts, and closed remediations is where most teams stall. The four stages have to connect; each one depends on the previous.

Stage 1: Capture. Instrument with OpenTelemetry or a framework-native integration (LangChain, LangGraph). Capture full session traces with consistent execution IDs, token costs at every step, and tool call payloads. This is table stakes but it has to be done right, missing session IDs or partial tool payloads make every later stage harder.

Stage 2: Evaluate. Run automated evaluations on every interaction, not a sample. Built-in classifiers cover common failure modes: hallucinations, bad tool calls, agent forgetfulness. Custom classifiers cover the domain-specific failures that matter for your agent. One of our fintech customers instantiated a "mismatched GL codes" classifier on their production logs because their failure mode wasn't something any pre-built eval would catch. As they described it: even end states had a hundred different variations, and an agent that can take many different paths based on input can fail in ways that a start/end state check never surfaces.

Custom classifier deployment at Sentrial works like this: a team identifies a new failure mode, labels three or four example logs, and a fine-tuned classifier is running in production within a minute. The approach uses post-training models rather than LLM-as-judge, which matters at scale. Passing millions of logs through an LLM judge with a system prompt produces accuracy that's honestly worse than running a few unit tests. You need something that can keep up with production volume without degrading in precision.

Stage 3: Alert. Fire real-time Slack alerts on error spikes or behavioral anomalies. The alert payload matters as much as the trigger: failure classification, contributing step, agent version, and a direct link to the failing session. An alert that says "hallucination rate elevated" is a starting point; an alert that says "hallucination rate up 4.2% since last deploy, concentrated in sessions where tool X returned null" is actionable.

Stage 4: Debug. Replay or fork from any intermediate step in a failing run. Source-code-level failure pinpointing with fix suggestions, rather than scrolling through raw log output trying to reconstruct state. Braintrust's analysis of debugging tools describes this correctly: once the failing step is identified, the trace can be loaded and re-run against the exact production inputs, tool calls, intermediate steps, and model configuration that triggered the failure. Reconstruction from logs is what you do when you don't have replay; it's not a workflow.

The four stages only work as a loop. Evaluations without traces have no context for why a failure happened. Alerts without evaluations are latency dashboards. Debugging without replay means manually reconstructing agent state from logs. Most teams end up stitching together a tracer, a separate eval use, a custom alerting webhook, and whatever replay tooling they can build themselves. Sentrial covers all four stages in one platform, which means the data flowing from trace to eval to alert to debug is consistent and correlated by default.

The Misconception That Trips Up Most Teams

The most common mistake we see: teams treat agentic observability as an evaluation problem. They instrument evals in staging, the benchmark passes, and they ship to production assuming the eval signal carries over. It doesn't.

Evals on a fixed test set measure a snapshot. Production observability measures a stream. A model that scores 94% on your eval suite can simultaneously be hallucinating on 8% of production sessions that your test set doesn't represent, and you won't know until a user complains or you're actively monitoring production. Static benchmark scores don't predict production reliability under distribution shift, as LayerLens describes: production-grade evaluation requires pre-deployment gating, shadow deployment, continuous monitoring, drift detection, and governance. Passing a benchmark is step one of a multi-step process, not the finish line.

The finance startup from earlier is the clearest example. The PDF ingestion failure went undetected because the agent produced outputs that looked fine. No evals were checking every step along the way, you can't manually review millions of logs. The agent succeeded end-to-end from a surface perspective. That failure, in their words, "would not have been caught for a century" without automated production monitoring.

The second layer: even teams that do run production evals often run them on samples, not full coverage, which compounds the gap. Braintrust's LLM monitoring article makes the point clearly: online evaluation catches issues that offline testing cannot anticipate, and agent workflows multiply monitoring complexity because a single user request can trigger chains of tool calls.

The insight worth internalizing: passing evals and having observability are not the same thing. Evals are a pre-flight check. Observability is the flight recorder. They serve different purposes, and neither replaces the other. For the deeper argument on what evals miss, see Your Agent Passing Evals Means Nothing.

How to Get Started with Agentic AI Observability

The starting path is ordered deliberately. Skipping steps creates gaps that compound: you can't alert on failure rates you haven't evaluated; you can't evaluate failures you haven't traced.

Step 1: Instrument. Start with OpenTelemetry or a framework-native integration (LangChain, LangGraph). Capture session traces with consistent execution IDs, token counts and costs at every step, and tool call inputs and outputs. If you're using LangGraph, the OTel-compatible traces are already emitting most of what you need. The integration should take under an hour.

Step 2: Define failure modes before you need them. What counts as a hallucination for your agent? What does goal abandonment look like in your session flow? What tool failure rate is acceptable before you alert? These definitions are harder to write than the instrumentation, and teams that skip this step end up with data and no signal.

Step 3: Run evaluations on production data. Not just the staging eval set. Production data distribution shifts from the moment you launch; eval coverage needs to follow it. If you're not evaluating production sessions, you have traces without quality signal.

Step 4: Alert on eval-derived metrics. Set thresholds on hallucination rate, tool failure rate, and goal completion rate, not just latency and error count. A latency alert fires when the service is slow. An eval-derived alert fires when the agent is wrong.

For teams who want to build this stack themselves, the open-source components exist for individual stages: OTel for instrumentation, OpenInference or OpenLLMetry semantic conventions for GenAI spans, DeepEval or Arize Phoenix for evals. The integration work connecting them into a coherent alerting and debugging loop is real; plan for it.

At Sentrial, the integration is five lines of code against whatever agent framework you're using. Built-in classifiers for hallucinations, bad tool calls, agent forgetfulness, and jailbreaking are running immediately. Custom classifiers deploy from three to four labeled examples in under a minute. The full loop, trace, evaluate, alert, debug with replay, is one platform rather than four tools glued together.

Once baseline observability is running, the next step worth pursuing is prompt A/B testing with statistical rigor on live traffic, measuring impact on the failure modes you've defined. See our piece on prompt A/B testing in production for how to structure that without introducing confounds.

Looking ahead: agentic systems in 2026 are already running for hours, invoking dozens of tools, and operating as multi-agent orchestrations where one agent's output is another agent's input. Research on agent drift shows that LLM-based agents introduce behavioral drift where decision-making patterns progressively deviate from design specifications without explicit parameter changes or system failures. The cost of retrofitting observability after a production incident is always higher than instrumenting from the start. And Dynatrace's 2026 agentic AI report found that 44% of organizations still rely on manual methods to monitor agent interactions. Manual review doesn't scale to millions of sessions, and it doesn't catch failures you don't already know to look for.

FAQ

What is agentic AI observability?

Agentic AI observability is end-to-end visibility into what a multi-step AI agent did, whether it did it correctly, and exactly where it went wrong. It covers session-level tracing (every decision step and tool call), automated quality evaluations (hallucination detection, tool failure classification, goal completion tracking), real-time alerting on behavioral anomalies, and source-code-level debugging. The key distinction from traditional observability: it measures outcome quality, not just service health.

What telemetry data does agentic AI observability collect?

The four signal types are: session traces with per-step spans and consistent execution IDs; interaction logs including LLM call inputs/outputs and tool call payloads; metrics covering token usage, cost per step, and latency; and quality evaluation results (hallucination rate, tool failure rate, goal completion rate, safety flags). Token costs and eval assertion results at the span level are the fields most commonly missing from teams using generic APM instrumentation.

How is agentic AI observability different from traditional observability?

Traditional observability monitors infrastructure: CPU, memory, request latency, HTTP error rates. It tells you whether the service stayed up and responded. Agentic observability monitors behavior: whether the agent reached the right goal, through the right steps, at an acceptable cost. An agent that returns a hallucinated answer with a 200 OK response is invisible to traditional APM. From our analysis of 12 million logs, 78% of agent failures are silent from an infrastructure perspective, no crash, no timeout, just a wrong answer.

What is OpenTelemetry's role in agentic observability?

OpenTelemetry provides the vendor-neutral wire format and semantic conventions for capturing GenAI spans. LangChain and LangGraph already emit OTel-compatible traces. OTel handles instrumentation and data collection; it doesn't evaluate output quality or fire semantic alerts. Think of it as the foundation: you need it, but you also need an evaluation and alerting layer on top to know whether what the traces contain represents correct behavior.

What should teams monitor for agent quality beyond latency and errors?

The primary quality signals are hallucination rate, tool failure rate, goal abandonment rate, user frustration signals (session abandonment patterns, repeated reformulations), and agent forgetfulness (failure to use context from earlier in a session). For domain-specific agents, custom classifiers for failure modes unique to the use case often matter more than the generic ones. A finance agent may need to track mismatched account codes; a support agent may need to track failed resolution patterns. The failure modes worth monitoring are the ones your users actually experience, which often aren't the ones your eval suite was built to catch.

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started

Share

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started