Evals Explained: Offline, Online, and What Both Miss

Most teams that ship AI agents have passing evals and production failures at the same time. That's not a tooling problem. It's a misunderstanding of what evals are actually measuring and when they fall short. We've spent a lot of time in production logs watching confident, fluent agent responses turn out to be completely wrong, with zero alerts fired, zero errors thrown, and zero indication anything had gone sideways.

Evals Are Simpler Than Teams Treat Them

Evals are repeatable scoring processes that measure AI model or agent outputs against defined criteria. Three things make up every eval system: a set of inputs, a grader that scores outputs against expected outcomes or quality criteria, and a way to track results over time so you catch regressions. Every eval system, from OpenAI Evals to custom production classifiers, is a variation of that same structure.

The term gets used two different ways and the distinction matters immediately. "Offline evals" means running your agent against a curated dataset before shipping a change, scoring the outputs, and confirming nothing regressed. "Online evals" or "production evals" means running classifiers against live traffic after your agent is already deployed, scoring real user sessions as they happen. These aren't interchangeable. They answer different questions, run at different times, and fail in different ways.

Offline evals give you fast, controlled regression tests. Online evals give you ground truth about what's actually happening with real users. Anthropic's engineering team describes effective evaluation as combining automated evals for fast iteration, production monitoring for ground truth, and periodic human review for calibration. All three layers are necessary because each one catches failures the others miss.

Most Agent Failures Don't Look Like Failures

Traditional monitoring catches crashes, timeouts, and latency spikes. It works because the failure is explicit: an error is thrown, a status code is wrong, a service goes down. AI agents fail differently, and that's what makes them hard.

Across 12 million logs we've analyzed at Sentrial, roughly 22% of agent issues were explicit tool call failures where something stopped the run cleanly. The other 78% were silent failures: hallucinations at the top, user frustration second, agent forgetfulness or laziness third. In the vast majority of agent failures, no error fires. The agent returns a complete, fluent response. The user just gets a wrong or useless answer and leaves.

Beam.ai's analysis of enterprise AI risk found that only 5% of AI agents reaching production have mature monitoring, and 47% of organizations using generative AI have experienced hallucinated outputs. That 5% number is bad. Traditional APM tracks uptime. AI monitoring has to simultaneously track relevance, coherence, factual accuracy, and user satisfaction, and none of those appear in any error log.

This is exactly what evals are designed to catch. A grader that scores whether an agent's stated action matched the actual tool-call result, or whether the agent's answer contradicts information it retrieved two steps earlier, will surface failures that generate zero alerts in any conventional monitoring stack. Our article on LLM observability and silent failures covers the full picture, and we have a dedicated piece on AI agent regression testing that goes deeper on that workflow.

Three Pieces That Every Eval System Is Built From

Regardless of tooling, every eval system comes down to three artifacts.

The first is a dataset: a collection of inputs and, optionally, expected outputs. Offline eval datasets are often curated manually, formatted as JSONL files where each record contains an agent input and a ground-truth label or reference output. These are sometimes called golden sets. The quality of your golden set sets the ceiling for what your offline evals can tell you.

The second is a grader. A grader takes an agent output and returns a score or label. The simplest graders are deterministic: does the response contain a required field? Does it avoid a prohibited string? Did the tool call include all required parameters? More complex graders use LLM-as-judge, passing the output to a second model for scoring, or human labels for cases that require judgment. The grader defines what "good" and "bad" mean for your specific evaluation.

The third is traces. A trace is the full record of an agent run: the user input, every intermediate step, every tool call and its result, and the final response. Traces are what make production evals possible. When you run classifiers against live traffic, you're classifying traces, not isolated outputs. Key metrics that matter in agent evaluation include task completion rate, plan adherence, tool selection accuracy, and tool correctness. Each of those requires trace-level data to score reliably, because a single final output doesn't tell you where in the pipeline the failure occurred.

At Sentrial, we use OpenTelemetry for initial log capture, then run classification and clustering on top of that trace data. Capturing traces is table stakes. What you do with them is what determines whether you actually find the failures that matter.

Offline and Online Evals Are Solving Different Problems

Offline evals run in CI before a change ships. You have a dataset, you run the agent against it, you score outputs, and you check whether this change broke anything that was previously working. Tools like LangSmith and OpenAI Evals are purpose-built for this workflow. Fast feedback, known ground truth, reproducible results.

Online evals run against live production traffic. There's no curated test set, no known ground truth, and the volume is orders of magnitude higher. You're scoring real user sessions as they happen, trying to catch failure modes you may not have thought to write tests for. Online evaluation measures system performance using real user traffic without relying on predefined ground truth; offline evaluation works with static datasets where high scores don't always guarantee production success.

Here's the structural problem most teams discover too late: sampling. The standard approach to managing cost and latency in online evals is to score a representative sample of traffic, typically 5 to 10 percent of logs. Sampling works fine for detecting broad trends. If your hallucination rate is 8%, a 5% sample will find it.

But silent regressions in production don't always manifest at 8%. A hallucination that happens 0.3% of the time is still thousands of failures per month at scale, and it won't appear reliably in a 5% sample. We think you need to monitor every single log, not a percentage of them. If you sample, you miss the issues that show up in production, and those are exactly what teams care about once the agent is actually being used.

The decision framework is straightforward: sampling is sufficient for aggregate trend detection. Full coverage is required for catching silent regressions on rare but critical failure modes. Most production agents eventually need both.

LLM-as-Judge Has a Deeper Problem Than Cost

The common warning about LLM-as-judge is that it's expensive and slow. That's real, but it's not the most important problem.

The deeper problem is that a generic LLM judge is brittle on exactly the failure modes that matter most in production. Confident-sounding wrong answers, subtle hallucinations, context-drop failures, these are the cases where a general-purpose judge tends to score outputs as acceptable, because the outputs are fluent, grammatically correct, and plausible-sounding. The judge has no knowledge of what's actually true in your domain, what your agent is supposed to be capable of, or what your users expect. It grades surface quality, not behavioral correctness for your specific system.

Research on prompted LLM judges confirms this directly: the reliability of prompted judges is limited by prompt design, formatting, and in-context example selection, and the underlying models suffer from sensitivity, brittleness, and hallucinations. A judge that hallucinates cannot reliably catch hallucinations. I keep seeing teams build elaborate LLM-as-judge setups and then wonder why they're still getting blindsided by production failures.

There are mitigations worth implementing. Deterministic invariants should be your first line: does the response contain a required field? Does it avoid a prohibited phrase? Did all required tool calls execute before the final response was generated? These checks are cheap, fast, and can't be fooled by fluent writing. For subjective criteria where deterministic checks fall short, human-in-the-loop disagreement workflows that route borderline cases to a domain expert improve calibration over time.

The real fix is moving away from generic judges toward classifiers trained on your agent's actual traffic. A model fine-tuned on your specific logs learns what "wrong" looks like for your system, not a generic definition of quality. This is why we use post-trained models at Sentrial rather than a standard LLM-as-judge use. We cluster millions of logs using models post-trained on each customer's data, fine-tuned to the patterns of that specific agent's traffic. The result is meaningfully higher accuracy at lower cost than passing everything through a general-purpose judge.

Beyond built-in classifiers for hallucinations, bad tool calls, agent forgetfulness, and jailbreaking, teams can instantiate whatever classifiers they need. Check three or four example logs and a fine-tuned classifier deploys in under a minute. A finance company we worked with built a mismatched GL codes classifier this way because their end-state checks had too many variations to handle with simple logic. Agents can reach the same outcome through a hundred different paths, and static checks can't track that kind of behavioral variation.

A Real Failure That Evals Should Have Caught

The most useful way to understand evals is to watch them fail to catch something that should have been caught.

A Series B finance startup we worked with built an agent to automate accounts receivable end-to-end. The flow was: ingest a vendor PDF, extract line items, run tool calls to compute a quote, return the quote to the customer. The agent launched in week two. It looked correct. Different PDFs produced different quotes and the prices seemed approximately right.

What was actually happening: the PDF ingestion step was broken. The agent wasn't extracting the document data. Instead, it was hallucinating the quote price based on other context available to it, including the RFP and supplementary customer data, not the actual PDF. The outputs were plausible. No error fired. From a surface perspective, the agent succeeded end-to-end.

This is what an eval should have caught, and what standard monitoring had no chance of catching. An offline eval grading "does the agent produce a quote?" would have passed. An online eval grading "does the agent's extracted line items match the source PDF?" would have flagged it immediately. The grader here isn't "did the agent return something," it's "did the specific intermediate artifact, the extracted PDF data, actually match the source document?"

your grader has to score the right thing. In agentic workflows, hallucination cascades are dangerous because each step can technically succeed (HTTP 200, tool call completed) while the entire workflow produces a wrong result. Catching these requires classifiers that score intermediate artifacts, not just final outputs.

This is also where replay becomes essential. When an eval flags a failure in a live session, a score alone isn't enough to fix anything. Being able to replay that session from any intermediate step, rerunning from the retrieval step, the reasoning step, or the response generation step in isolation, is what shortens the distance between "eval caught something" and "we know what to fix." Our piece on AI agent tracing covers how replay and step-level tracing work together in practice.

Here's the Sequence That Actually Works

The four steps that work regardless of tooling are the same every time.

First, pick one failure mode you can describe precisely. Not "the agent gives bad answers," but "the agent states that a tool call succeeded when the tool returned an error." Specificity is what makes a grader buildable.

Second, collect 50 to 100 real examples from production logs and label them manually. This isn't glamorous work, but it's the only way to build a grader that reflects what "wrong" actually looks like in your system rather than in a generic benchmark.

Third, build a grader. Start deterministic if you can. A regex check or a structured field validation will outperform LLM-as-judge for any failure mode that can be expressed as a rule. Use LLM-as-judge only for genuinely subjective criteria where deterministic checks don't reach. As research on iterative agent evaluation confirms, routing automated failures to domain experts for correctness review is how you keep graders calibrated over time.

Fourth, run your first offline eval against this set before the next release. You now have a regression baseline.

For tooling: OpenAI Evals and LangSmith are the standard starting points for offline, CI-style evaluation. DeepEval, RAGAS, and TruLens cover specialized scoring for RAG pipelines and agent tasks. Our LLM observability platform roundup and Arize vs. Braintrust comparison cover the tradeoffs across these options in more detail.

For production, when offline evals are running and you need full-log coverage rather than sampling, Sentrial handles continuous classification across every interaction. A common and correct sequencing recommendation is to establish offline evaluation first, then test coverage, then metric-to-outcome alignment, and only then add online or continuous evaluation. That sequence matters. Don't skip offline evals because you have production monitoring. They serve different jobs and each catches failures the other misses.

The practical heuristic to close on: your eval system isn't done when it passes. It's done when it would have caught your last three production failures before your users did. Work backward from the failures you've already seen and build graders for those first.

Frequently Asked Questions

What are evals in AI?

In AI, evals (evaluations) are repeatable scoring processes that measure model or agent outputs against defined criteria. They combine a dataset of inputs, a grader that scores outputs, and a mechanism to track results over time. The goal is to catch regressions and failure modes systematically, either before deployment (offline evals) or against live production traffic (online evals).

What is the meaning of evals?

"Evals" is shorthand for "evaluations" in the context of AI and language model development. It refers to structured testing systems that score whether an AI model or agent is performing correctly according to specified criteria. The term is distinct from general-purpose usage of "evaluation" and specifically refers to AI evaluation frameworks, test datasets, graders, and production classifiers used to assess agent behavior.

How do offline and online evals differ, and when should I use each?

Offline evals run before deployment against curated datasets with known expected outputs. They catch regressions fast and work well in CI pipelines. Online evals run against live production traffic with no pre-defined ground truth. They catch failure modes that never appeared in your test set and surface how your agent actually behaves with real users. Use offline evals to confirm nothing regressed before a release. Use online evals to monitor what's happening in production continuously. Most mature setups need both, because each catches failures the other structurally misses.

Why doesn't traditional monitoring catch AI agent failures?

Traditional APM and error monitoring detect crashes, timeouts, and HTTP errors. AI agent failures are mostly silent: a hallucinated answer, a forgotten context, a confident wrong response. No exception is thrown. No alert fires. Across the production logs we've analyzed at Sentrial, 78% of agent failures were in this silent category, surfaced only through structured evaluation, not error logs.

How do I get started with evals if I don't have a dataset yet?

Start by picking one specific failure mode you can describe in a sentence. Pull 50 to 100 real sessions from your production logs and label them manually as pass or fail for that failure mode. Build the simplest possible grader, deterministic checks first, LLM-as-judge only for genuinely subjective criteria. Run your first offline eval before your next release. Once you have a baseline, extend incrementally. Don't try to cover every failure mode at once; a grader that accurately catches one thing is worth more than five graders that catch nothing reliably.

Your Evals Are Passing While Your Agent Is Failing Users

Evals Are Simpler Than Teams Treat Them

Most Agent Failures Don't Look Like Failures

Three Pieces That Every Eval System Is Built From

Offline and Online Evals Are Solving Different Problems

LLM-as-Judge Has a Deeper Problem Than Cost

A Real Failure That Evals Should Have Caught

Here's the Sequence That Actually Works

Frequently Asked Questions

Try Sentrial

Try Sentrial

Your Evals Are Passing While Your Agent Is Failing Users

Evals Are Simpler Than Teams Treat Them

Most Agent Failures Don't Look Like Failures

Three Pieces That Every Eval System Is Built From

Offline and Online Evals Are Solving Different Problems

LLM-as-Judge Has a Deeper Problem Than Cost

A Real Failure That Evals Should Have Caught

Here's the Sequence That Actually Works

Frequently Asked Questions

Try Sentrial

Datadog vs Dynatrace Can't Tell You When Your AI Agent Is Wrong

Dynatrace Alternatives in 2026 That Actually Fit Your Use Case

Langfuse Is Good at Tracing. Here's Where It Stops.

Try Sentrial