On this page
AI agent tracing is the practice of recording every step of an agent's execution as a structured, hierarchical log called a trace. That means LLM calls, tool invocations, retrieval queries, state transitions, and decision branches, captured at the session and per-step level so engineers can reconstruct exactly what happened. Without it, a multi-step agent is a black box with a final answer and no explanation.
Most guides stop there. They show you how to get spans into a UI and call it observability. But a trace you can't act on is just expensive logging. The real value of AI agent tracing is the unbroken chain from execution visibility to automated evaluation to source-code-level debugging. This article explains what that chain looks like and why each link matters.
What AI Agent Tracing Actually Captures
AI agent tracing records the complete execution of an agent session as a hierarchy of spans: a root span for the user request, child spans for each agent node (planning, retrieval, tool execution, LLM inference), and events for state transitions. Every span carries structured metadata including per-step latency, token usage and cost per LLM call, tool name with arguments and results, and any error details. Together, these give you a step-by-step reconstruction of the agent's reasoning.
Contrast this with traditional application monitoring. A crash logger tells you the agent died. A trace tells you which planning node chose the wrong tool, what arguments it passed, and how latency compounded across six hops before the user gave up.
The table-stakes fields every trace should contain:
- • Session-level inputs and outputs: the initial user request and final agent response
- • Per-step latency: start and end timestamps for every node
- • Token usage and cost: attributed per LLM call, not just per session
- • Tool call details: tool name, input arguments, returned result, and error code if any
- • Retrieval details: the query, number of results returned, and relevance metadata
- • State transitions: how the agent moved from one node to the next
Braintrust's agent observability guide captures the scope well: observability includes tool selection, tool arguments, model responses, memory reads, memory writes, state transitions, and decision branches. Without all of those, you have partial visibility at best.
The harder point is what these fields allow you to do later. We'll get to that.
Why Production AI Agents Can't Run Without Tracing
For non-deterministic, multi-step agents, inspecting the final output is essentially useless as a quality signal. If the answer looks correct but a tool call silently failed and the LLM papered over it, you'll never see the problem at the surface level. The failure is invisible until a customer reports it or a downstream system breaks.
Across 12 million logs, we found that roughly 22% of issues were explicit tool call failures, something that visibly stopped the agent. The remaining 78% were silent: hallucinations first, user frustration second, agent forgetfulness or laziness third. Most production agent failures don't produce an error code. The user just gets a wrong answer and leaves.
A finance startup we worked with learned this the hard way. They deployed an agent to automate vendor quotes in week two of launch. The agent ingested a PDF, extracted line items, and computed a price. It "worked" in the sense that it returned different quotes for different documents. But it wasn't reading the PDF correctly. It was hallucinating the quote based on surrounding context and silently ignoring the actual document data. The end-to-end behavior looked fine, so no one flagged it. Without step-level tracing that showed the PDF extraction span returning garbage, this would have gone undetected indefinitely.
LangChain's production monitoring guide puts it directly: traditional APM cannot detect when an agent selects the wrong tool or gets trapped in a reasoning loop. And as Braintrust notes, in production, agent failures often stay invisible until a customer reports the issue.
Three specific failure classes that only step-level tracing can surface:
- • Wrong tool arguments: bad inputs to a tool that produce a syntactically valid but semantically wrong output
- • Latency cascades: one slow retrieval step that compounds across later nodes and pushes the session over SLO
- • Goal abandonment: the user drops off mid-session because the agent looped or stalled, with no error ever firing
This connects to the broader observability picture. We cover the full four-layer model in LLM Observability Has Four Layers. Most Teams Only Build One.. Tracing is the data layer; everything else is built on top.
How Agent Traces Are Structured: Spans, Nodes, and Hierarchies
A well-structured agent trace maps directly onto the execution graph. One root span represents the user session. Under it, child spans represent each agent node: planning, retrieval, LLM inference, tool execution. Under those, events capture state transitions and intermediate signals. The hierarchy isn't cosmetic; it's what lets you isolate a specific node's latency or cost without aggregating across the whole session.
The concrete attributes engineers should capture per span type:
- • LLM inference spans: model version, prompt tokens, completion tokens, total cost, latency, finish reason
- • Tool spans: tool name, input arguments as structured JSON, result or error code, execution time
- • Retrieval spans: query text, number of results, retrieval latency, top-k scores if available
- • Planning/agent node spans: node name, input state, output state, which child spans it spawned
For portability, OpenTelemetry's GenAI semantic conventions define standardized attributes for tracing tasks, actions, agents, teams, artifacts, and memory across complex AI workflows. Using OTel gives you backend portability: the same instrumentation can export to Jaeger, Langfuse, LangSmith, or any OTel-compatible sink without changing your agent code.
If you're using LangChain or LangGraph, the node-based execution graph maps cleanly onto this hierarchy. LangGraph in particular is a useful reference model because its graph structure is explicit: each node is a named unit of execution, which makes span boundaries obvious. Datadog's overview of OTel GenAI conventions describes the goal: a standard schema for tracking prompts, model responses, token usage, tool and agent calls, and provider metadata that makes observability measurable and interoperable.
The setup at Sentrial uses OpenTelemetry for initial logging, with data analysis running on top to drive classification and clustering of problems across runs. The instrumentation itself is five lines of code. The value comes from what happens after the traces land.
Token Cost and Latency Attribution: The Part Most Teams Skip
Every guide mentions token cost as a captured field. Very few explain what to do with it. Per-step cost attribution changes what questions you can answer. Instead of "why did this month's bill spike," you can ask "which node is responsible for 60% of session cost" and get a precise answer.
The operational decision rules that fall out of this data:
- • If a single retrieval step accounts for more than 40% of session latency, the fix is the retriever, not the prompt. Rewriting the system prompt won't speed up a slow vector search.
- • If token cost spikes correlate to a specific planning node, that's a prompt optimization target. You can compress the context window for that node specifically rather than blunt-cutting across the whole agent.
- • If latency for one tool call pattern consistently pushes sessions over your SLO, you can set a node-level alert that fires before the user churns, rather than discovering the problem in next week's retention report.
The Gen Academy's analysis of production LLM failures makes the scale point cleanly: a demo runs ten agent loops a day; production runs ten thousand, and the bill arrives at month-end. Most teams discover this after their first surprise invoice.
Vellum's agent observability guide flags the right tracking targets: token usage and costs, tool success and failure rates, and context quality. The key word is "and." Tracking one of these in isolation tells you something went wrong. Tracking all three tells you where and why.
The teams that get the most from cost and latency attribution are the ones who set node-level SLOs early, before they need them, rather than scrambling to add instrumentation after their first production incident.
What You Can Do With a Trace (Beyond Just Looking at It)
Here's what most guides miss. Viewing spans in a UI is not observability; it's log browsing. The actual value of AI agent tracing is that trace data enables a complete improvement loop: automated evaluation, replay, prompt testing, and code-level debugging. The trace is the raw material for all of it.
Automated evaluations: Once you have per-step trace data, you can run classifiers against every session, not a sample. A classifier looking for hallucinated tool arguments, zero-result retrieval followed by confabulation, or user frustration signals can flag the exact span where the failure occurred. One of our fintech customers, Sailfin Tech, instantiated a custom "mismatched GL codes" classifier on their production logs. You'd think a start-state or end-state check could catch that. It can't, because the agent's output has hundreds of valid variations depending on input. The only way to catch it reliably is span-level classification across every run.
LangChain describes the pattern this way: the fastest teams capture production traces, analyze them to find patterns, build test datasets from real usage, run evaluations to measure quality, and use those results to drive improvements. The key is that each step feeds the next.
Replay and fork: Take a failing session and re-execute from any intermediate span. This means you can test a prompt fix against the exact context that caused the failure, without waiting for that edge case to recur in production. Vellum's guide illustrates this: if a travel agent recommends the wrong hotel, full traces let you see the retrieved listings, tool call results, and the exact prompt that led to the mistake, and then replay from that point.
Prompt A/B testing: Running experiments on live traffic with statistical rigor requires a trace per session as the unit of analysis. Without traces, you're comparing aggregate metrics and guessing what changed.
Source-code-level debugging: When a classifier flags a failure at a specific span, the next step is knowing which function to fix. That link, from flagged span to code location to suggested fix, is what closes the loop. A Fortune 1000 customer using Sentrial across supply chain, HR, and marketing agents drove their error rate from 20% to under 10% in a single week by following this pattern: trace every step, classify every session, pinpoint the failing code, ship the fix.
Maxim AI's observability overview captures where the field is heading in 2026: unifying tracing, evaluations, and monitoring to build trustworthy AI. Most teams are still stitching together three or four tools to get here. Sentrial covers tracing, automated evaluations, prompt A/B testing, real-time alerting, and source-code-level debugging in one platform, because every handoff between tools is a place where context gets dropped.
The Misconception That Kills Most Tracing Setups
The non-obvious one: teams instrument their agent framework and assume they're done. Framework-level instrumentation, like LangChain's auto-tracing, captures the orchestration layer. Tool implementations, external API calls, and retrieval internals are often opaque unless explicitly spanned. The steps most likely to fail are the ones with the least trace coverage.
The finance company's PDF ingestion failure described earlier is a clean example. The orchestration layer reported success. The tool span, if it had been explicitly instrumented, would have shown malformed extraction output at step one.
Three specific mistakes we see repeatedly in 2026:
Sampling. Traditional APM sampling (trace 1% of requests) is reasonable for web services where failures are statistically representative. AI agent failures are long-tail and session-specific. A hallucination that occurs in one specific user context won't appear in a 1% sample. You have to classify every interaction to catch the failures that actually matter. Using an LLM as a judge system across sampled logs makes this worse, not better. In our view, LLM-as-judge systems are less accurate than a few targeted evals, especially as agents grow to hundreds of tool calls running for hours.
Disabling content recording for privacy. Teams sometimes turn off content capture entirely to protect PII. This guts debuggability. The practical middle ground: keep structured metadata (tool name, error code, latency) always-on in production. Enable content recording in dev and replay contexts. Apply PII redaction to tool arguments and retrieval results at the instrumentation layer rather than turning off tracing altogether.
Treating end-state checks as sufficient. OnPage's 2026 observability review identifies the pattern: silent failures are situations where an AI application continues running but produces incorrect outputs, and traditional monitoring signals rarely capture them. An end-state check that sees a completed session with a non-null response will return green. The GL code mismatch, the hallucinated quote, the wrong hotel: all green, all wrong.
What we've seen when teams switch from platforms that only show input-LLM-output to step-level tracing: the first thing they notice isn't a new failure type. It's the volume of failures they were already having that they couldn't see.
How to Get Started With AI Agent Tracing
The instrumentation sequence, framework-agnostic:
- 1. Instrument a root span per user session. Attach the session ID, user context (redacted as needed), and the initial input.
- 2. Add child spans for each agent node. Capture the node name, latency (start and end timestamps), and error status.
- 3. Add LLM-specific attributes to inference spans. Model name, prompt tokens, completion tokens, cost, finish reason.
- 4. Add tool-specific attributes to tool spans. Tool name, input arguments as structured JSON, result or error code, execution time.
- 5. Export to a backend that supports hierarchical queries. Confirm parent-child relationships appear correctly before shipping. Broken span context propagation (missing or duplicated trace IDs) is the most common first-day failure and the hardest to debug after the fact.
For stack choices:
- • OpenTelemetry + Jaeger: self-hosted, vendor-neutral, good for teams that want full control. Requires more setup but gives you portability.
- • LangSmith or Langfuse: managed UI for LangChain and LangGraph teams who need visibility quickly. Good for getting traces into a browser fast. Limited when you need evaluations or production alerting layered on top.
- • Sentrial: five lines of instrumentation via OpenTelemetry, LangChain, LangGraph, or custom Python. Tracing plus automated evaluations, prompt A/B testing, real-time Slack alerts, and source-code-level debugging. The right choice when you need the full loop, not just span storage.
LangChain's production monitoring guide frames the requirement well: agent observability requires monitoring the inputs and outputs themselves, not just system metrics around them. When agents are having multi-turn conversations with users, the primary signal lives in the conversations.
Once tracing is working, automated evaluations are the next step. We cover that in Your Agent Passing Evals Means Nothing, including why pre-deployment evals miss the failures that actually appear in production and what to run instead.
FAQ
What is AI agent tracing?
AI agent tracing is the practice of recording every step of an agent's execution as a structured, hierarchical log. Each step, including LLM calls, tool invocations, retrieval queries, and state transitions, becomes a span with metadata for latency, token usage, cost, and error status. The result is a complete reconstruction of what the agent did during a session, at the step level rather than just the input-output level.
What does agent tracing enable you to do in production?
Tracing is the data layer that makes everything else possible. With per-step traces, you can run automated classifiers to flag hallucinations, bad tool calls, and user frustration across every session. You can replay a failing session from any intermediate step to test a fix against the exact context that caused the failure. You can do prompt A/B testing with statistical rigor using real sessions as the unit of analysis. And you can pinpoint which function in your codebase produced the failure, not just which session it appeared in.
What should you instrument: LLM calls, tool calls, retrieval, or decision flow?
All of them, explicitly. Framework auto-instrumentation covers the orchestration layer, but tool implementations and retrieval internals are often black boxes unless you add spans manually. The steps most likely to fail are the ones with the least default coverage. For each span type: LLM spans need model, tokens, cost, and finish reason; tool spans need tool name, input arguments, result or error code, and execution time; retrieval spans need query text, result count, and latency.
How is AI agent tracing related to OpenTelemetry?
OpenTelemetry is the vendor-neutral standard for emitting spans. Its GenAI semantic conventions define standardized attributes for AI workloads: prompts, model responses, token usage, tool and agent calls, and provider metadata. Using OTel gives you portability: the same instrumentation exports to any OTel-compatible backend. Many frameworks like LangChain already emit OTel spans; the risk is double-instrumentation if you add your own spans on top without checking what the framework is already emitting.
Does agent tracing capture sensitive data?
It can, which is why many teams disable content recording entirely and hurt their own debuggability. The better approach is to keep structured metadata (tool name, error code, latency, token count) always-on in production, apply PII redaction at the instrumentation layer to tool arguments and retrieval results, and enable full content recording in dev and replay contexts where you need to diagnose specific failures. Turning off tracing to protect PII leaves you flying blind on the inputs that caused the failure.
Share