Your Agent Passing Evals Means Nothing

Evals score AI outputs against known criteria, but 78% of agent failures have no ground truth. Here's what evals catch and where they go blind.

N

Neel Sharma

May 28, 202610 min read

Evals are repeatable scoring processes that measure AI model or agent outputs against defined criteria. They use test datasets, graders, and expected outcomes to catch regressions before or after deployment. In AI contexts, the term covers two distinct workflows: offline evals (run in CI against curated datasets before a change ships) and online evals (run against live production traffic). That distinction shapes every decision that follows.

What Evals Are, Exactly

Evals are a structured mechanism for answering one question: did this agent output meet the standard we defined? You provide inputs, you define what "correct" looks like, and a grader scores the output against that definition. The result is a signal: this change made things better, worse, or didn't move the needle.

The term gets used loosely. In the AI context here, we're talking about evaluation frameworks for LLMs and agents, not performance reviews or school assessments. The two modes to hold in mind from the start: offline evals run before deployment on golden datasets you control; online evals run in production on real user sessions where there is often no predetermined ground truth. Most of the hard problems live in that second category.

Why Evals Matter More Than Monitoring Alone

Traditional monitoring tells you when your agent crashed or timed out. It does not tell you when your agent confidently gave a user wrong information, forgot context from three turns ago, or hallucinated a product feature that doesn't exist. No error fires. No alert triggers. The user just gets a wrong answer and leaves.

Across 12 million logs we've analyzed at Sentrial, around 22% of agent issues were explicit tool call failures, meaning something that made the agent stop. The remaining 78% were silent: hallucinations ranked first, user frustration second, agent forgetfulness or laziness third. The majority of agent failures don't crash the run. They show up as incorrect or unhelpful answers that users encounter and never report. Evals are the mechanism designed to surface these outcomes before they compound.

Beam.ai research found that only 5% of AI agents that reach production have mature monitoring, and 47% of organizations using generative AI have experienced hallucinated outputs. Traditional APM tracks uptime; production AI demands simultaneous tracking of relevance, coherence, factual accuracy, and user satisfaction. Evals are how you operationalize those dimensions.

For teams thinking about regression testing specifically, our piece on AI agent regression testing covers that workflow in depth. Here, the focus is on what evals are and where they fall short.

The Building Blocks: Datasets, Graders, and Traces

Every eval system, regardless of tooling, is built from three core artifacts.

Datasets are collections of test inputs and, optionally, expected outputs. A golden dataset might be 200 past support conversations with human labels: "this response was correct," "this one hallucinated the return policy." In practice, these are usually stored as JSONL files with one example per line. The quality of your dataset determines the quality of your eval signal. Garbage in, garbage out is especially true here.

Graders are the scoring functions. A grader takes an agent output and returns a score or label: pass/fail, a 0-1 confidence score, or a category like "hallucination," "incomplete," or "off-topic." Graders can be deterministic (does the response include the required order ID?) or model-based (does this response accurately reflect the information in the retrieved context?). The choice of grader type matters enormously, which is why it gets its own section below.

Traces are the raw logs of actual agent runs: the user message, every tool call, the intermediate reasoning steps, the final response. In offline evals, you're generating these from a test set. In online evals, production traces become your eval substrate. A trace from a multi-step agent might include a PDF extraction step, three tool calls to external APIs, and a final response generation, and a well-designed eval system scores each step, not just the output.

Braintrust's framework research identifies execution path validity, task completion rate, tool selection accuracy, and tool correctness as the key metrics that separate concerns in agent evaluation. Getting clear on these artifacts first makes any evals documentation easier to work through.

Offline vs. Online Evals: Two Different Jobs

Offline evals are CI-style regression testing: you have a dataset, you run your agent against it, you score outputs. Did this prompt change break something that was working? Did the new model version handle edge cases better or worse? Tools like LangSmith and OpenAI Evals are built for this workflow. You define the golden set, run the eval, review the delta.

Online evals are fundamentally different. You're scoring live traffic, real user sessions, actual agent runs on inputs you didn't design or anticipate. There's no predetermined ground truth for most of what you're seeing. A user asked something unexpected. The agent took a novel path. You're scoring behavior after the fact, often without knowing in advance what "correct" looks like.

The harder problem with online evals is sampling. Most approaches score 5-10% of production logs to manage cost and latency. That's sufficient for detecting aggregate trend shifts: "our hallucination rate went from 3% to 5% this week." But it structurally misses rare failure modes. A hallucination that occurs in 0.3% of sessions won't reliably appear in a 5% sample. If that failure mode involves a specific user segment, a specific input type, or a specific tool call sequence, you'll miss it entirely.

As we've found working across production agents at Sentrial: if you sample or only track a slice, you miss the issues that show up in production, and those are exactly what teams care about once the agent is actually used. The decision framework is straightforward. Sampling works for trends. Full coverage is required for catching silent regressions on rare but critical failure modes.

Online evaluation techniques research confirms this pattern: high offline scores don't always guarantee success in production, and online evaluation measures performance using real user traffic without relying on predefined ground truth.

The LLM-as-Judge Problem Nobody Warns You About

The obvious critique of LLM-as-judge is cost and latency. That's real but manageable. The non-obvious problem is that LLM judges are brittle on exactly the failure modes that matter most in production.

A generic LLM judge evaluates surface quality: is this response fluent? Is it coherent? Does it sound confident? But confident and fluent is precisely what a hallucination looks like. The judge has no knowledge of what's actually true in your domain, what your specific agent's tool calls returned, or what your users' expectations are. It grades plausibility, not correctness. A response that sounds like a correct answer to a refund question gets rated highly even if the refund was never processed.

Research on LLM judge reliability found that the reliability of prompted judges is inhibited by factors including prompt design, formatting, and in-context example selection, and that underlying LLMs suffer from sensitivity, brittleness, and hallucinations, as do prompted LLM judges built on top of them.

Teams work around this in a few ways. Deterministic invariants handle the things you can check structurally: does the response contain a required field? Does it avoid a prohibited phrase? Does the stated action match the actual tool-call result? Golden-set comparisons catch regression on cases you've already labeled. Human-in-the-loop review handles borderline cases where automated graders disagree.

The root issue, though, is that a general-purpose judge doesn't know what "wrong" means for your specific system. That's why at Sentrial we use post-trained models fine-tuned on each customer's actual agent traffic rather than a generic eval use. A classifier trained on your logs learns what failure looks like in your context, not a definition of quality abstracted from thousands of other systems. Teams can also instantiate custom classifiers in minutes to track whatever failure mode they care about, whether that's hallucinated GL codes, incomplete tool-call chains, or jail-break attempts, without building a full labeling pipeline first. Teams check 3-4 example logs, deploy a fine-tuned classifier, and get full-log coverage on that failure mode immediately.

One example of this in practice: a fintech company we work with, Sailfin Tech, instantiated a custom classifier on their production logs targeting mismatched GL codes. This was a highly specific failure mode that only became non-deterministic once they introduced agents into their workflow. When they had their first working classifier running, it didn't just surface overall issues with their agent in production; it also revealed gaps in what they had originally tested with evals back when their system was still deterministic. The classifier gave them visibility into a failure category that their earlier eval setup had no mechanism to catch, because the failure itself didn't exist in that form until the agent was live.

What Evals Look Like in Practice

The clearest example of where evals should exist but often don't: a Series B finance startup that built an agent to handle vendor account receivables end-to-end.

The agent's job was to take a vendor PDF, extract the relevant data, run several tool calls to compute a quote, and send it to the customer. It launched in week two. Outputs looked correct: different PDFs produced different quotes, and the prices seemed approximately right. No errors, no alerts, no crashes.

The agent was hallucinating the quotes. The PDF ingestion step was broken. Instead of extracting data from the actual document, the agent was using surrounding context, the RFP text, customer history, other signals, to construct a price estimate using LLM reasoning. The tool calls succeeded. The response was confident. The output was wrong on every quote.

An offline eval on this system would have needed to check not just the final output but whether the quote price was derived from the extracted PDF data specifically. That's a grader that compares intermediate tool-call results, not just the final response. An online eval running against live traffic would need to classify whether PDF extraction was functioning correctly across every session, not a 5% sample.

When a session like this gets flagged, the most useful next step isn't a score; it's being able to replay that session from an intermediate step to isolate whether the failure was in the extraction step, the reasoning step, or the response generation. Without that, you know something broke but not where.

The finance startup said the issue would not have been caught "for a century" with their existing monitoring. They were right. Nothing in their stack was looking at intermediate agent behavior, only final outputs and error codes.

This pattern is common. Silent failure research shows that in agentic workflows, hallucination cascades compound: an inventory agent that invents a non-existent SKU in step one will call downstream APIs to price, stock, and ship it. Every API call returns HTTP 200. The entire workflow is a failure. Evals that only check the final response miss everything upstream.

How to Get Started with Evals

The sequence that works regardless of tooling:

Step 1: Pick one failure mode you can describe precisely. Not "the agent is bad sometimes." Something specific: "the agent states that a refund was processed when the tool call returned an error." You can build a grader for that. You can't build a grader for vague quality concerns.

Step 2: Collect 50-100 real examples from production logs and label them manually. This is the work most teams skip. Spending two days labeling real sessions produces a dataset that reflects actual failure patterns, not ones you imagined in advance. Your golden dataset should come from production, not from synthetic inputs.

Step 3: Build a grader. Start deterministic if you can. Does the response contain the expected field? Does the stated action match the tool-call result? Use LLM-as-judge only for subjective criteria where you genuinely can't write a rule, and be aware of the brittleness described above.

Step 4: Run your first offline eval before the next release. You now have a regression test. If the next model update or prompt change breaks this behavior, you'll know before it ships.

For tooling: OpenAI Evals and LangSmith cover offline/framework evals well. DeepEval, RAGAS, and TruLens handle specialized RAG and agent scoring. For tool selection between observability platforms, our LLM observability platform comparison and Arize vs Braintrust breakdown go deeper on the tradeoffs.

For teams that want tracing, evaluations, A/B testing, alerting, and code-level debugging in one platform, Sentrial covers all of it. Session-level tracing captures inputs, outputs, latency, and token costs at every step. Automated evaluations flag hallucinations, tool failures, user frustration, and goal abandonment across every interaction, not a sampled slice. Prompt A/B testing with statistical rigor runs in production. Real-time Slack alerts fire on error spikes and behavioral anomalies, with source-code-level failure pinpointing and fix suggestions so engineers know exactly where to look. And when a session needs deeper investigation, replay and fork from any intermediate step in the agent run is built in. Sentrial integrates via OpenTelemetry, LangChain, LangGraph, or custom Python agents.

The practical heuristic: start offline, get comfortable with your graders, then extend to online. Skipping offline evals because you have production monitoring is like skipping unit tests because you have error logging. They're different jobs. Both matter. The teams that get this right in 2026 build both layers, and they take the time to make the graders actually match their failure modes, not generic definitions of quality.

FAQ

What are evals in AI? Evals are repeatable scoring processes that measure AI model or agent outputs against defined criteria. They use test datasets, graders, and expected outcomes to detect regressions and quality issues. In agent contexts, evals can run offline against curated datasets before deployment, or online against live production traffic.

What is the meaning of evals? In AI development, "evals" is shorthand for evaluations: structured tests that assess whether a model or agent is producing correct, useful, or safe outputs. The term covers everything from simple pass/fail checks to LLM-based scoring systems that classify outputs across multiple quality dimensions.

How do offline and online evals differ in production monitoring? Offline evals run before deployment against curated datasets with known expected outputs; they're CI-style regression tests. Online evals run against live production traffic where there's often no predetermined ground truth. Offline evals catch regressions on known failure modes; online evals catch new failure modes that emerge from real user behavior. Both serve different functions and most mature teams run both.

How do I actually use evals? Start by picking one specific failure mode you can describe precisely. Collect 50-100 real examples from production logs and label them manually. Build a grader, starting with deterministic checks before reaching for LLM-as-judge. Run your first offline eval before the next release. Once that workflow is stable, extend to online evaluation of live traffic for continuous coverage.

Why do evals sometimes miss the most serious failures? Evals are designed to score outputs against defined criteria, which means they can only catch failures you anticipated in advance. Production agents fail in ways that have no ground truth: a response that's fluent and confident but factually wrong, a multi-step workflow where every tool call succeeds but the underlying data was corrupted in step one, a failure mode that occurs in 0.3% of sessions and never appears in a 5% sample. These gaps between "evals passed" and "agent is wrong" are where silent failures live. A full-stack production monitoring platform that combines session-level tracing, automated evaluations with full-log coverage, real-time alerting, and code-level debugging is what closes that gap in practice.

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started

Share

Try Sentrial

Catch behavioral AI agent failures traditional APM tools miss.

Get started