AI Agent Tracing Explained: Spans, Evals & Debugging

Most engineers who set up AI agent tracing think they're done once spans appear in a UI. They're not. The hard part isn't capturing a trace, it's turning that trace into something you can act on before a user churns or a calculation goes wrong. We've seen this gap play out across enough production deployments that it's worth walking through the whole picture: what tracing actually captures, how it's structured, where most setups quietly break, and what a complete observability loop looks like once agents are in the wild.

Tracing an agent is not the same as understanding it

That definition sounds straightforward. The operational difference it makes is not. A traditional crash logger tells you the agent died. A trace tells you which planning node chose the wrong tool, what arguments it passed, and how latency compounded across six steps before the user gave up. Those are not the same diagnostic.

Agent observability is the practice of capturing every step an AI agent takes during execution, including tool selection, tool arguments, model responses, memory reads, memory writes, state transitions, and decision branches. The table-stakes data types every trace should contain are:

• Session-level inputs and outputs
• Per-step latency with start and end timestamps at each node
• Token usage and cost attribution per LLM call
• Tool call outcomes: tool name, arguments, result, and error details
• Retrieval query plus result count
• State transitions between agent nodes

The non-determinism of agents is what makes this hard. When tool calls and branching behavior vary across runs, trace-and-end-state-only approaches stop being sufficient. You need the semantic and behavioral signals that show up even when the run doesn't follow the same path twice.

Silent failures are the norm, not the exception

final-output inspection fails silently for multi-step agents. If an answer looks correct but a tool call failed internally and the LLM papered over it with a confident response, you'll never know without step-level visibility.

Traditional APM tools can't detect when an agent selects the wrong tool or gets trapped in a reasoning loop. They can confirm a 200 status code came back, but they can't show that the agent looped twice, called the wrong tool, or hallucinated a billing policy. In production, agent failures often stay invisible until a customer reports the issue.

A finance startup we worked with discovered this the hard way. Their vendor agent would take in a PDF, extract data, and compute a quote. It launched in week two and the outputs looked reasonable, different quotes for different PDFs, approximately the right price range. But the PDF ingestion step was broken. The agent was never actually reading the document. It was hallucinating the quote based on surrounding context from the RFP and customer metadata, not the actual PDF contents.

this went undetected because the agent succeeded end-to-end from a surface perspective. Without step-level tracing and evaluations on every interaction, they had no way to see that the ingestion node was silently failing. Without catching it, they estimated it wouldn't have surfaced for a very long time.

This failure class isn't rare. Across 12 million logs we've analyzed, approximately 22% of issues were explicit tool call failures, something that made the agent stop. The remaining 78% were silent: hallucinations, user frustration, agent forgetfulness or laziness. The majority of production agent failures never throw an error. The user just gets a wrong or useless answer and leaves.

That's exactly why tracing is the prerequisite layer for everything else. Once you have per-step trace data, you can build automated quality signals on top of it. Without it, you're flying blind. For a deeper look at what sits above the trace layer, our article on LLM observability covers the full four-layer stack.

Span hierarchies need to match your execution graph

The span hierarchy in an agent trace maps directly onto the agent's execution graph. The standard structure is one root span per user session, with child spans for each agent node, planning, retrieval, tool execution, LLM inference, and events capturing state transitions between nodes.

For each span type, the concrete attributes you need to capture differ:

• LLM inference spans: model name and version, prompt tokens, completion tokens, cost per call, latency
• Tool call spans: tool name, input arguments, result payload, error code if applicable
• Retrieval spans: query text, number of results returned, latency
• Planning/orchestration spans: node name, decision branch taken, latency

OpenTelemetry semantic conventions define standard attributes for tracing tasks, actions, agents, teams, artifacts, and memory across complex AI workflows. Using OTel's gen-AI semantic conventions gives you portability across backends, Jaeger, Langfuse, LangSmith, and means you're not locked into any single vendor's trace format. This standard schema covers prompts, model responses, token usage, tool calls, and provider metadata, making AI observability measurable and interoperable across frameworks.

One practical note for teams using LangGraph specifically: its node-based execution graph maps cleanly to the root → node → tool/LLM/retrieval hierarchy, which makes it a useful reference model even if you're building on a different framework. When LangChain or LangGraph already emit spans automatically, watch for double-instrumentation, framework auto-tracing covers the orchestration layer, but tool implementations and external API calls often need explicit spans added.

In our setup at Sentrial, we use OpenTelemetry for initial logging, then run data analysis on top of it to identify issues and cluster problems across runs. Dropping in five lines of instrumentation code gets the data flowing. The value comes from what you do with it after.

Cost and latency attribution matter at the node level, not the session level

Most guides list token cost as a field traces should capture. Fewer explain how to use it once you have it.

Per-step cost attribution lets you identify which specific node, not which session, is running over budget. If a single retrieval step accounts for more than 40% of session latency, the fix is the retriever, not the prompt. If token cost spikes correlate to a specific planning node, that's a prompt optimization target, not a reason to switch models everywhere. Blunt cost cuts that reduce model quality across the board are what happens when teams only have session-level totals.

A demo runs ten agent loops a day; production runs ten thousand, and the bill arrives at month-end. That's the moment teams realize session-level attribution isn't enough. Node-level attribution is what turns that surprise invoice into an actionable engineering task.

The same logic applies to latency. A trace that records per-step timestamps lets you set node-level SLOs and fire alerts the moment a specific tool call pattern starts pushing overall sessions over threshold, before users churn rather than after. Teams should track token usage and costs, tool success and failure rates, and context quality at each step, not just for the session as a whole.

We've seen this operationally: you can't monitor every single log at the session level and catch granular cost anomalies. Full coverage, per node, is what gives you actionable signals rather than aggregate graphs that move after damage is already done.

Viewing spans in a UI is not the same as having observability

This is where most tracing guides stop too early. Viewing spans in a UI is not the goal. The trace is the data layer that enables a complete improvement loop, and stopping at visualization means you've built expensive logging, not observability.

The full loop looks like this: traces feed automated evaluations, evaluations surface specific failure patterns, those patterns drive prompt A/B tests with statistical rigor, and replay lets you test fixes against the exact context that caused the failure without waiting for that edge case to recur in production. The fastest teams capture production traces, analyze them to find patterns, build test datasets from real usage, and run evaluations to measure quality.

Automated evaluations at scale. Once you have per-step trace data, you can run classifiers against every session, not a sample, to flag the specific span where the agent hallucinated a tool argument or where a retrieval step returned zero results and the agent confabulated an answer. One of our fintech customers, Sailfin Tech, instantiated a custom classifier for "mismatched GL codes" on their production logs. This wasn't something they could track with a start check or end-state check: even the end states had hundreds of variations. If a GL code wasn't in the system, the agent might not output a product with that code at all. Step-level traces made the classifier possible.

Replay and fork. Full trace visibility means you can replay a full run step by step. If an agent recommends the wrong output, traces let you see the retrieved context, tool call results, and the exact prompt that led to the mistake. Being able to fork from any intermediate step means you can test a prompt change against the precise failing context rather than synthetically reconstructing it.

Prompt A/B testing with real signal. Most prompt testing happens offline and doesn't reflect how changes land on the actual distribution of production inputs. Once tracing is running, you can run experiments on live traffic and measure impact on the metrics that matter, frustration rates, refusals, accuracy, agent forgetfulness, not just perplexity or offline benchmark scores. We cover this in detail in our article on prompt A/B testing.

AI agent observability in production is about unifying tracing, evaluations, and monitoring to build trustworthy AI. Most teams stitch together three or four tools to get there. At Sentrial, this full loop, tracing, automated evaluations, A/B testing, real-time Slack alerts, and source-code-level failure pinpointing, is what the platform is built around, because the gap between "we can see spans" and "we know what to fix and where in the code" is where production failures compound.

Framework instrumentation covers less than you think

Here's the one that catches experienced teams: instrumenting your agent framework and assuming you're done.

Framework-level instrumentation, LangChain auto-tracing, LangGraph callbacks, covers the orchestration layer. It tells you what the framework called and in what order. Tool implementations, external API calls, and retrieval internals are often opaque unless you explicitly span them. The steps most likely to produce silent failures are precisely the ones with the least automatic trace coverage. Operational maturity requires span-level usage metrics; teams must track exactly which text chunks the model referenced in its final logic, not just what was loaded into memory.

The second misconception is sampling. Traditional APM sampling, trace 1% of requests, is reasonable for web services where failures are statistically distributed. Agent failures are not. A hallucination triggered by a specific combination of user context and memory state won't show up in a 1% sample. We've been consistent about this from the start: you have to monitor every single log, not a percentage of them. If you sample or only track a slice, you miss the issues that show up in production, and those are exactly what teams care about once the agent is actually used. One of the biggest challenges in production AI systems is detecting silent failures, situations where an AI application continues running but produces incorrect or misleading outputs; traditional monitoring signals such as latency and error rates rarely capture these.

A third misconception worth addressing is the privacy overcorrection. Teams sometimes disable content recording entirely to avoid capturing PII, which guts debuggability. The practical answer is more surgical: keep structured metadata, tool name, error code, latency, always on in production; apply PII redaction to tool arguments and retrieval results; and enable full content recording in development and replay contexts where you need it to debug. Turning tracing off entirely to solve a privacy concern trades one risk for another.

The pattern we've seen most often when teams switch to Sentrial is that they were using a tool that showed logs, mainly end-to-end, without tool calls in between, and they found they could either build on top of it for their specific use case or accept surface-level visibility. Once agents entered the picture, that surface-level visibility stopped being sufficient. What teams needed was semantic analysis that you can't catch with traces alone.

Getting tracing working in production

The setup sequence that works regardless of framework:

1. Instrument a root span per user session. This is your trace anchor. Every subsequent span should be a child or descendant of this root.
2. Add child spans for each agent node with latency (start/end timestamps) and a pass/fail status.
3. Add LLM-specific attributes to inference spans: model name, prompt tokens, completion tokens, cost per call.
4. Add tool-specific attributes to tool spans: tool name, input arguments, result payload, error code.
5. Export to a backend that lets you query and visualize the parent-child hierarchy, not just flat log lines.
6. Verify span context propagation before shipping. The most common first-day failure is broken trace IDs: spans appear in the UI but the parent-child relationships are wrong, and the hierarchy that makes debugging useful is absent.

For stack selection, the main options each serve different needs:

• OpenTelemetry plus Jaeger for teams who want a self-hosted, vendor-neutral setup with full control over the data pipeline
• LangSmith or Langfuse for LangChain and LangGraph teams who want a managed UI quickly, with callbacks as the standard integration mechanism
• Sentrial for teams who need tracing plus automated evaluations, production alerting, and source-code-level debugging in one platform, rather than connecting separate tools for each part of the loop. Setup is self-serve: five lines of instrumentation for whatever agent file you're working with, OpenTelemetry underneath, and classification running on top of it

Agent observability requires monitoring the inputs and outputs themselves, not just the system metrics around them. Robust monitoring for multi-step, tool-augmented, non-deterministic systems requires granular distributed tracing, automated evaluations, real-time alerts, and data curation, traditional APM is insufficient.

Once tracing is working and you've confirmed correct span hierarchies, the next step is automated evaluations: defining the failure modes specific to your agent and running classifiers against full production traffic. Our article on evals covers what offline evals miss and why production-time classification is the gap most teams hit next.

Frequently Asked Questions

What is AI agent tracing?

AI agent tracing is the practice of recording every step of an agent's execution, LLM calls, tool invocations, retrieval queries, state transitions, and decision branches, as a structured, hierarchical log called a trace. The goal is to give engineers the ability to reconstruct exactly what the agent did during any session, which spans succeeded or failed, what inputs each step received, and how latency and cost accumulated across the run.

What does agent tracing enable you to do in production?

Tracing is the data layer that makes everything else possible. With per-step trace data, you can run automated evaluations against every session to flag specific failure spans, set node-level SLOs and fire alerts when latency patterns push sessions over threshold, replay and fork from any intermediate step to test prompt fixes against the exact failing context, and run statistically rigorous A/B tests on prompt changes using real production traffic. Without traces, you're limited to surface-level metrics that don't distinguish between correct and incorrect answers.

What should you instrument, LLM calls, tool calls, retrieval, or decision flow?

All four. LLM inference spans should capture model, token counts, and cost. Tool call spans should capture tool name, input arguments, result payload, and error code. Retrieval spans should capture the query and result count. Planning and decision nodes should capture which branch was taken and latency. Framework auto-instrumentation typically covers the orchestration layer but leaves tool implementations and external API calls as black boxes, those need explicit spans added.

How is AI agent tracing related to OpenTelemetry?

OpenTelemetry provides the vendor-neutral standard for emitting and exporting trace data. Its gen-AI semantic conventions define a standard schema for prompts, model responses, token usage, tool calls, and provider metadata, making traces portable across backends like Jaeger, Langfuse, and LangSmith. Using OTel means you're not locked into a single vendor's instrumentation format and that your trace data can be consumed by multiple tools without re-instrumentation.

Does agent tracing capture sensitive data?

It can, and that requires a deliberate policy rather than disabling tracing entirely. The practical approach is to keep structured metadata, tool names, error codes, latency, token counts, always on in production, apply PII redaction to tool arguments and retrieval results that may contain user data, and enable full content recording in development and replay environments where you need it to debug. Turning tracing off to solve a privacy concern removes the visibility needed to catch silent failures that may themselves represent compliance or accuracy risks.

AI Agent Tracing: Why Spans in a UI Aren't Enough

Tracing an agent is not the same as understanding it

Silent failures are the norm, not the exception

Span hierarchies need to match your execution graph

Cost and latency attribution matter at the node level, not the session level

Viewing spans in a UI is not the same as having observability

Framework instrumentation covers less than you think

Getting tracing working in production

Frequently Asked Questions

Try Sentrial

Try Sentrial

AI Agent Tracing: Why Spans in a UI Aren't Enough

Tracing an agent is not the same as understanding it

Silent failures are the norm, not the exception

Span hierarchies need to match your execution graph

Cost and latency attribution matter at the node level, not the session level

Viewing spans in a UI is not the same as having observability

Framework instrumentation covers less than you think

Getting tracing working in production

Frequently Asked Questions

Try Sentrial

Datadog vs Dynatrace Can't Tell You When Your AI Agent Is Wrong

Dynatrace Alternatives in 2026 That Actually Fit Your Use Case

Langfuse Is Good at Tracing. Here's Where It Stops.

Try Sentrial