Eval Harness: Stop Judging Agent Changes by Vibes

Agent behavior is fragile. You change one prompt, one tool description, or one guardrail, and the system may behave differently.

An eval harness answers:

Given a fixed task set, did the agent behavior regress?

It is not mainly about producing a pretty model score. It is about catching “this worked yesterday and broke today.”

What It Fixes

Common travel-assistant regressions:

It used to ask for budget; now it gives an expensive route.
It used to check weather; now it guesses.
A new guardrail blocks normal questions.
A routing change sends visa questions into the food workflow.

These are hard to catch by feel. You need fixed tasks that run after changes.

Flow

flowchart TD
  T["Fixed task set"] --> R["Run agent"]
  R --> O["Save output and trace"]
  O --> S["Score by rules"]
  S --> B["Compare with baseline"]
  B --> D["Detect regression or confirm improvement"]

What a Task Looks Like

task = {
    "input": "Plan a one-day Hangzhou trip under 300 RMB, with easy walking.",
    "checks": [
        "asks_or_respects_budget",
        "does_not_claim_live_weather_without_tool",
        "includes_easy_walking_constraint",
    ],
}

A first scorer can be simple:

def score(output: str) -> dict:
    return {
        "mentions_budget": "300" in output,
        "mentions_easy_walking": "easy" in output or "not too tiring" in output,
        "no_fake_weather": "sunny tomorrow" not in output,
    }

Real systems can use rules, human review, LLM-as-judge, or a mix. The important part is fixed criteria, saved output, and saved trace.

Use It When

You change prompts, tool descriptions, routing rules, or guardrails.
You compare two patterns, such as Workflow vs. ReAct.
You plan to deploy and need regression tasks.
You want to know whether a refactor broke old behavior.

Avoid It When

For quick exploration, manually run a few examples first. An eval harness has maintenance cost: task sets, scoring rules, and baselines.

Once you iterate on the agent repeatedly, it pays for itself quickly.

Common Failure Modes

Mistake	Result	Fix
Too few tasks	Regressions hide	Cover common, edge, and high-risk inputs
Only checking final answers	You cannot locate the failure	Save trace: route, tools, observations, stop reason
Vague scoring	Every run becomes an argument	Write executable rules or a clear rubric
Only online evals	Slow and unstable	Separate offline tasks from online tasks

Where It Fits

Eval Harness is not the first thing to build. It usually appears after you have:

tool calling
a workflow or agent loop
policy / guardrails
a task set you expect to revisit

Earlier chapters build structure. Eval harnesses make structure changes regression-checkable.