Buyer's Playbook

AI Agent Testing and Evaluation: The Production Playbook

By Riya Thambiraj14 min
a computer screen with a bunch of data on it - AI Agent Testing and Evaluation: The Production Playbook

What Matters

  • -Only 52% of teams running AI agents have evaluation frameworks, despite 89% having observability - the gap explains why model upgrades feel like rolling dice.
  • -Traditional software testing assumes determinism and traditional ML eval assumes fixed input-output. AI agents break both assumptions simultaneously.
  • -Three evaluation layers are essential: unit-level (did the agent call the right tool with correct parameters?), trajectory-level (did it take a reasonable path?), and outcome-level (did it achieve the goal?).
  • -Anthropic recommends starting eval datasets with 20-50 tasks derived from real production failures, not synthetic benchmarks.
  • -A production eval pipeline can be built in two weeks - week one for failure collection and harness setup, week two for LLM-as-judge and CI/CD integration.

Your AI agent has dashboards. Latency charts. Token usage graphs. Error rate monitors. You can tell within seconds if the agent is up or down. You cannot tell if it is right or wrong.

This is not unusual. According to the LangChain State of AI Agents survey of over 1,300 practitioners, 89% of teams running AI agents have observability tooling. Only 52% have evaluation frameworks. That 37-point gap explains a complaint we hear constantly: "After the last model upgrade the agent feels worse, but we can't prove it."

You cannot improve what you cannot measure. And you cannot measure agent quality with uptime dashboards.

TL;DR
AI agent testing requires three evaluation layers: unit-level tool call testing (did it call the right function with the right parameters?), trajectory-level workflow testing (did it take a reasonable path?), and outcome-level grading (did it achieve the goal?). Start with 20-50 test cases from real production failures. Build an LLM-as-judge grading system. Wire it into CI/CD. The entire pipeline takes two weeks to stand up.

Observability vs. Evaluation: The 37-Point Gap

Adoption rate
This 37-point gap is why model upgrades feel like rolling dice
Observability (What Happened)
89% of teams
Evaluation (Was It Correct)
Only 52% of teams
What it tells you
You can see every step but can't grade the result
Observability (What Happened)
Traces, latency, token counts, error rates
Evaluation (Was It Correct)
Whether the output was right or wrong
When it helps
Your agent can be 'up' and wrong simultaneously
Observability (What Happened)
Is the agent up or down?
Evaluation (Was It Correct)
Is the agent correct or incorrect?
Model upgrade impact
Without evals, you cannot prove the last upgrade made things worse
Observability (What Happened)
Shows performance metrics changed
Evaluation (Was It Correct)
Shows whether quality improved or degraded

Why Standard Testing Breaks with AI Agents

Software testing assumes determinism. You call a function with input X, you expect output Y. If you get output Z, the test fails. Run it again, same result. This is the foundation of unit testing, integration testing, and end-to-end testing.

Traditional ML evaluation assumes fixed input-output pairs. You feed the model a test set, compare predictions to ground truth labels, and compute accuracy. The model produces the same output for the same input (or close to it with temperature 0).

AI agents break both assumptions.

Non-Determinism by Design

An agent asked to "find the best restaurant nearby" might call a search tool, then a reviews API, then a maps API. Or it might call the maps API first, then search, then reviews. Both paths might produce correct results. Both might produce different correct results.

This is non-determinism at the workflow level, not just the output level. The same input can produce different execution paths that lead to different valid outputs. You cannot test this with assertion-based unit tests.

Multi-Step Reasoning Chains

A traditional function call is atomic. An agent workflow is a chain of 5-20 decisions, each contingent on the previous result. Testing the final output tells you if the chain worked. It does not tell you where it broke when it did not work.

If Agent A calls the wrong tool at step 3, the remaining 7 steps may still produce a plausible-looking output. The output is wrong, but it does not look wrong. Traditional end-to-end tests might pass. Only trajectory-level testing - inspecting the sequence of decisions - catches this class of failure.

Tool Use Side Effects

When an agent calls an API, it may trigger real-world side effects. Sending an email. Creating a database record. Initiating a payment. You cannot rerun these tests freely. You need mocking, sandboxing, or careful idempotency design. And your tests need to verify not just that the agent called the right tool, but that it called it with the right parameters and handled the response correctly.

Benchmark Fragility

Popular evaluation benchmarks have a dirty secret: they are poor predictors of real-world performance. Recent research has shown that web agent benchmarks - the standardized tests used to compare agent systems - are "broken" in the sense that high benchmark scores do not correlate with production success.

The GAIA benchmark, one of the most respected agent evaluations, tells the story. Level 3 tasks (the hardest category) have a top score of just 61%, achieved by Writer's Action Agent. This means the best agent system in the world fails 39% of the time on difficult tasks. For context, 32% of organizations cite quality as their top barrier to putting agents in production.

Stanford HAI's 2025 AI Index Report found AI agent performance on SWE-bench jumped from 4.4% in 2023 to 71.7% in 2024 - a 67-percentage-point leap in one year. That's how quickly "benchmark leader" becomes "table stakes." Your eval suite must track real production tasks, not academic benchmarks, or it goes stale before your next model upgrade.

Benchmarks test what is easy to measure. Production tests what matters to measure. Build your eval suite from production data, not benchmarks. At 1Raft, every eval dataset we build for production agents starts with real failure cases, not academic benchmarks.

The Eval Harness: Infrastructure for Agent Testing

An eval harness is the runtime environment where you systematically test your agent against a dataset of tasks and grade the results. Anthropic has published extensively on this pattern, and it works.

Eval Harness Architecture

The runtime environment for systematically testing your agent against a dataset of tasks.

1
Task dataset

Collection of input-expected output pairs. Each task includes user input, required context, semantic outcome, and metadata (difficulty, category, expected tools).

YAML or JSON files
2
Harness runner

Executes the agent against each task, captures the full trace (every LLM call, tool call, decision point). Supports mocked tools for side effects. Enforces production guardrails.

Sandboxed environment
3
Grader

Evaluates whether agent output meets success criteria. Deterministic grading (exact match, regex, schema validation) or LLM-as-judge for semantic evaluation.

Multi-strategy grading
4
Results tracker

Stores historical results for regression detection. Compares across model versions and code changes. Feeds CI/CD quality gates.

Dashboard + alerts

Components of an Eval Harness

Task dataset. A collection of input-expected output pairs. Each task includes the user input, any required context, the expected outcome (not the exact output text - the semantic outcome), and metadata (difficulty level, category, which tools should be called).

Harness runner. The execution environment. Runs the agent against each task, captures the full trace (every LLM call, tool call, decision point), and records the output. Must support mocked tools for tasks with side effects. Must enforce the same guardrails as production (timeout limits, max iterations, cost caps).

Grader. Evaluates whether the agent's output meets the success criteria. Can be deterministic (exact match, regex, schema validation) or LLM-based (a separate model judges whether the output is correct). More on grading strategies below.

Results tracker. Stores historical results so you can track performance over time, detect regressions, and compare across model versions or code changes.

Building the Task Dataset

Anthropic recommends starting with 20-50 tasks derived from real failures. Not synthetic examples. Not benchmarks. Real production failures where the agent got it wrong.

Go through your support tickets, escalation logs, and user complaints. Find the cases where the agent failed. Write each one as a test case:

  • Input: The exact user query or trigger that caused the failure
  • Context: Any relevant state (user profile, account data, conversation history)
  • Expected outcome: What the agent should have done (described semantically, not as exact text)
  • Why it failed: The root cause (wrong tool, bad parameters, hallucinated data, missed escalation)

These 20-50 failure-derived tasks will catch more real problems than 500 synthetic tasks. Failures cluster around patterns - the same root cause produces multiple user-facing failures. Fix the pattern and you fix a category of problems.

Expand the dataset over time:

  • Week 1-2: 20-50 tasks from real failures
  • Month 1: Expand to 100 tasks (add happy paths, edge cases)
  • Month 3: Reach 200-300 tasks (add adversarial inputs, regression cases)
  • Ongoing: Add every new production failure as a test case

Three Evaluation Layers Every Agent Needs

A single pass/fail grade on agent output is not enough. You need evaluation at three layers, each catching different classes of failure.

Layer 1: Unit-Level Tool Call Testing

The most granular layer. Did the agent call the right tool? Did it pass the correct parameters? Did it handle the tool's response correctly?

This is the one layer where deterministic testing works reliably. Tool calls have defined schemas. You can assert exact parameter values, parameter types, and required fields.

What to test:

  • Tool selection accuracy - given input X, did the agent pick tool Y?
  • Parameter correctness - did it pass the right values in the right types?
  • Response handling - did it extract the relevant data from the tool's response?
  • Error handling - when the tool returned an error, did the agent retry, try an alternative, or escalate?

How to test: Mock the tools. Feed the agent a known input. Capture the tool call. Assert against expected parameters. This is fast, deterministic, and cheap to run. No LLM-as-judge needed.

Coverage target: Every tool the agent has access to should have at least 5 test cases - 3 happy path, 1 error case, 1 edge case.

Layer 2: Trajectory-Level Workflow Testing

The middle layer. Did the agent take a reasonable path through the workflow? Not just "did it arrive at the right answer" but "did it get there via a sensible sequence of actions?"

AI agent testing diverges most from traditional testing at this layer. Multiple paths can be correct. The agent that calls search-then-filter is as correct as the agent that calls filter-then-search - if both produce the right result. Your evaluation must allow for valid path diversity.

What to test:

  • Step count - did the agent solve the task in a reasonable number of steps? (An agent that takes 15 steps for a 3-step task is working but inefficient)
  • Tool sequence - did it call tools in a logical order? (Calling the refund API before looking up the order is wrong regardless of the final output)
  • Backtracking - when a step failed, did it recover gracefully?
  • Unnecessary actions - did it call tools that were irrelevant to the task?

How to test: Run the agent, capture the full trace, then evaluate the trace against trajectory criteria. Some criteria are deterministic (step count thresholds, prohibited tool sequences). Others require LLM-as-judge ("was this sequence of actions reasonable?").

Key Insight
An agent can produce the correct final output through a terrible trajectory. It got lucky. Trajectory evals catch fragile success before it becomes production failure.

This is especially critical in multi-agent systems where a bad trajectory in one agent cascades through the entire workflow.

Layer 3: Outcome-Level End-to-End Grading

The top layer. Did the agent achieve the goal? Not "did it produce text that looks correct" but "did the outcome match the success criteria defined for this task?"

This is the hardest layer to get right because outcomes are often semantic, not exact. A correct response to "What is our return policy?" can be worded a thousand ways. You cannot use string matching. You need semantic evaluation.

What to test:

  • Correctness - is the information in the output factually accurate?
  • Completeness - did the output address all parts of the question or task?
  • Actionability - if the agent was supposed to take an action (create a ticket, send an email, process a refund), did it complete the action correctly?
  • Harm avoidance - did the output avoid generating harmful, misleading, or off-policy content?

How to test: LLM-as-judge. A separate LLM evaluates the agent's output against the success criteria. The judge receives the original input, the expected outcome description, and the agent's actual output, then grades it on defined rubrics.

The LLM-as-judge pattern is not perfect. Judges have biases (they tend to prefer longer responses). They can disagree with each other on borderline cases. But they scale where human evaluation does not, and they correlate well with human judgment on clear pass/fail cases. Research published in PubMed Central found GPT-3.5 hallucinated on 39.6% of systematic review tasks and GPT-4 on 28.6% - even top models produce confident-sounding wrong answers. Grading semantics, not just format, is what catches this class of failure.

Cross-model judging

Use a different model for judging than the one powering your agent. If your agent runs on Claude, use GPT-4o as the judge. If your agent runs on GPT-4o, use Claude as the judge. Same-model judging introduces systematic bias.

This cross-model pattern is standard practice at 1Raft across every agent we ship.

Three Evaluation Layers Every Agent Needs

A single pass/fail grade on agent output is not enough. Each layer catches different classes of failure.

Layer 1
Unit-Level Tool Call Testing

Did the agent call the right tool with correct parameters? The one layer where deterministic testing works reliably.

Did it call get_order_status with order_id=12345?
Did it pass the right values in the right types?
Did it handle the tool's error response correctly?
Coverage target: 5 test cases per tool (3 happy, 1 error, 1 edge)
Layer 2
Trajectory-Level Workflow Testing

Did the agent take a reasonable path? Multiple paths can be correct. Evaluation must allow for valid path diversity.

Did it look up the order before attempting a refund?
Did it solve the task in a reasonable number of steps?
Did it avoid calling irrelevant tools?
When a step failed, did it recover gracefully?
Layer 3
Outcome-Level End-to-End Grading

Did the agent achieve the goal? Uses LLM-as-judge for semantic evaluation since correct outputs can be worded many ways.

Did the customer get the correct refund amount?
Is the information factually accurate and complete?
Did it avoid harmful or off-policy content?
Use a different model family for judging to avoid bias

Model Swap Regression Testing

"After the last model upgrade the agent feels worse, but we can't prove it."

Every team running AI agents in production has said this. Model providers ship updates continuously. GPT-4o-2024-08 behaves differently from GPT-4o-2025-01. Claude 3.5 Sonnet routes tool calls differently than Claude 4 Sonnet. A model that scored 94% on your eval suite last month might score 88% on the same suite with a new checkpoint.

Without regression testing, model upgrades are a coin flip. Research published in Scientific Reports found 91% of ML models degrade in performance over time - and that's before factoring in model provider updates that change behavior without warning. For production agents, this isn't a theoretical risk. It's a deployment reality.

The Regression Testing Protocol

Step 1: Baseline. Run your full eval suite against the current model. Record scores at all three layers. This is your baseline.

Step 2: Candidate run. Pin the new model version. Run the exact same eval suite. Record scores.

Step 3: Diff analysis. Compare scores layer by layer. If outcome-level accuracy drops by more than 2 percentage points, the upgrade fails. If trajectory efficiency degrades (agents take 30% more steps for the same tasks), the upgrade fails. If tool call accuracy drops, investigate before proceeding.

Step 4: Qualitative review. For any task that changed from pass to fail (or fail to pass), review the traces manually. Understand why. New model versions sometimes fix one category of failures while introducing another. The net score may be similar, but the failure distribution has shifted.

Step 5: Staged rollout. Do not swap models globally in one deployment. Route 10% of traffic to the new model. Monitor for 48 hours. Expand to 50%, then 100%. This is the same shadow-to-production pattern used for agent deployment, applied to model versions.

Cost-Quality Tradeoffs

Model swaps affect more than quality. They are about cost.

A smaller, cheaper model that scores 91% on your eval suite might be a better production choice than a larger model that scores 95% but costs 4x more per call. The eval suite gives you the data to make this decision empirically rather than guessing.

"A client asked us to upgrade their agent to the latest model. We ran the eval suite first. Outcome-level accuracy dropped 4 points on billing tasks even though overall score was similar. That 4-point drop would have generated hundreds of wrong invoices per week. The baseline saved them a painful rollback." - Ashit Vora, Captain at 1Raft

At 1Raft, we maintain eval baselines for every production agent. When a client asks "should we upgrade to the latest model?" the answer comes from data, not opinion. We run the eval suite, compare the numbers, and make a recommendation with evidence.

Building an Agent Eval Pipeline in Two Weeks

Most teams delay eval infrastructure because it feels like a large investment. It does not have to be. Here is a two-week plan that goes from zero evals to a production-grade testing pipeline.

Week 1: Foundation

Days 1-2: Collect failures. Go through escalation logs, support tickets, and user feedback from the last 30-60 days. Pull 20-50 cases where the agent got it wrong. Write each as a structured test case (input, context, expected outcome, failure reason). Store as YAML or JSON files.

Days 3-4: Build the harness. A minimal harness needs three things: a runner that executes your agent against a task and captures the trace, a mock layer for tools with side effects, and a results file that stores pass/fail plus traces. This is 200-400 lines of code if you build it from scratch. Platforms like Braintrust, Langfuse, Galileo, or Evidently AI provide pre-built harnesses if you prefer.

Day 5: Baseline. Run all 20-50 tasks through the harness. Grade them manually (human review, not LLM-as-judge - you need a ground truth baseline). Record your current pass rate. This is day zero. Every future improvement is measured against this number.

Week 2: Automation

Days 6-7: LLM-as-judge. Build a grading prompt that evaluates agent outputs against expected outcomes. The prompt should include rubrics: what makes a pass, what makes a fail, what is borderline. Test the judge against your manual grades from day 5. If agreement is below 85%, refine the rubrics until it reaches 90%+.

Days 8-9: Regression suite. Add happy-path test cases to cover the scenarios your agent handles well. Total dataset should reach 50-75 tasks. Set up scheduled runs - daily for production agents, on every PR for agents under active development. Platforms like Maxim and Braintrust support scheduled eval runs natively.

Day 10: CI/CD gates. Wire the eval suite into your deployment pipeline. Define a quality gate: if outcome-level accuracy drops below X% (start with 85-90%), the deployment is blocked. If tool call accuracy drops below Y%, the deployment is blocked. These gates prevent regressions from reaching production.

What This Gives You

After two weeks, you have:

  • A dataset of 50-75 tasks grounded in real production failures
  • An automated harness that runs your agent against the dataset and captures traces
  • LLM-as-judge grading that correlates 90%+ with human judgment
  • CI/CD gates that block regressions before they reach production
  • A baseline score you can track over time

This is not a complete evaluation system. But it is a working foundation that catches the failures that matter most - the ones your users already reported. Expand from here by adding new failure cases as they surface and increasing coverage across the three evaluation layers.

Common Eval Pipeline Mistakes

Testing with synthetic data only. Synthetic tasks test what you imagine going wrong. Real failures test what actually goes wrong. The overlap is smaller than you think. Always start with real failures.

Grading outputs, not trajectories. An agent that produces the right answer through a fragile path will fail unpredictably. Grade the path, not just the destination.

Running evals manually. Manual eval runs happen once, get discussed in a meeting, and are forgotten. Automated runs happen every day and block bad deployments. Automate from day one.

Skipping the baseline. Without a baseline, every eval run is an isolated data point. You cannot detect regressions without knowing where you started. Run, record, and preserve your day-zero scores.

Using the same model as judge and agent. Same-model judging introduces systematic blind spots. The judge shares the agent's biases and will rate its mistakes as acceptable. Use a different model family for judging.

The Evaluation Stack for Production AI Agents

The tooling ecosystem for agent evaluation has matured rapidly. Here is how the major platforms map to the three evaluation layers.

PlatformUnit TestingTrajectory TestingOutcome GradingCI/CD IntegrationPricing Model
BraintrustStrongStrongStrong (built-in judges)NativeUsage-based
LangfuseStrongGood (trace analysis)GoodNativeOpen-source core
GalileoGoodStrongStrong (guardrail metrics)GoodEnterprise
Evidently AIGoodModerateStrong (drift detection)NativeOpen-source core
MaximGoodGoodStrongGoodUsage-based

No single platform covers everything perfectly. Most teams at 1Raft combine a tracing platform (Langfuse or Braintrust) with custom eval scripts for domain-specific grading criteria. The platform handles infrastructure. The custom scripts handle business logic.

"Two weeks to build a proper eval pipeline feels like overhead. But the alternative is flying blind after every model update, every prompt change, and every new integration. Teams that skip evals don't save two weeks - they spend six months firefighting issues they could have caught in CI." - 1Raft Engineering Team

The Bottom Line

AI agent testing is not an optional nicety. It is the difference between agents that work in demos and agents that work in production. The 52% eval adoption rate explains the 32% quality barrier. Teams that invest in evaluation catch regressions before users do, make model upgrade decisions with data instead of intuition, and build agents that improve systematically over time. The three-layer framework (unit, trajectory, outcome), combined with a failure-derived task dataset and LLM-as-judge grading, gives you a testing foundation that scales. Build it in two weeks. Expand it continuously. Every production failure your agent encounters becomes a test case that prevents the same failure from happening again.

Frequently asked questions

1Raft has shipped 100+ AI products and builds eval pipelines into every agent deployment from day one. We test at three levels - tool calls, trajectories, and outcomes - so regressions are caught before users see them. Our 12-week delivery framework includes eval infrastructure as a core deliverable, not an afterthought.

Share this article