Eval Harness: Stop Judging Agent Changes by Vibes
Agent behavior is fragile. You change one prompt, one tool description, or one guardrail, and the system may behave differently.
An eval harness answers:
Given a fixed task set, did the agent behavior regress?
It is not mainly about producing a pretty model score. It is about catching “this worked yesterday and broke today.”
What It Fixes
Common travel-assistant regressions:
- It used to ask for budget; now it gives an expensive route.
- It used to check weather; now it guesses.
- A new guardrail blocks normal questions.
- A routing change sends visa questions into the food workflow.
These are hard to catch by feel. You need fixed tasks that run after changes.
Flow
flowchart TD
T["Fixed task set"] --> R["Run agent"]
R --> O["Save output and trace"]
O --> S["Score by rules"]
S --> B["Compare with baseline"]
B --> D["Detect regression or confirm improvement"]
What a Task Looks Like
task = {
"input": "Plan a one-day Hangzhou trip under 300 RMB, with easy walking.",
"checks": [
"asks_or_respects_budget",
"does_not_claim_live_weather_without_tool",
"includes_easy_walking_constraint",
],
}
A first scorer can be simple:
def score(output: str) -> dict:
return {
"mentions_budget": "300" in output,
"mentions_easy_walking": "easy" in output or "not too tiring" in output,
"no_fake_weather": "sunny tomorrow" not in output,
}
Real systems can use rules, human review, LLM-as-judge, or a mix. The important part is fixed criteria, saved output, and saved trace.
Use It When
- You change prompts, tool descriptions, routing rules, or guardrails.
- You compare two patterns, such as Workflow vs. ReAct.
- You plan to deploy and need regression tasks.
- You want to know whether a refactor broke old behavior.
Avoid It When
For quick exploration, manually run a few examples first. An eval harness has maintenance cost: task sets, scoring rules, and baselines.
Once you iterate on the agent repeatedly, it pays for itself quickly.
Common Failure Modes
| Mistake | Result | Fix |
|---|---|---|
| Too few tasks | Regressions hide | Cover common, edge, and high-risk inputs |
| Only checking final answers | You cannot locate the failure | Save trace: route, tools, observations, stop reason |
| Vague scoring | Every run becomes an argument | Write executable rules or a clear rubric |
| Only online evals | Slow and unstable | Separate offline tasks from online tasks |
Where It Fits
Eval Harness is not the first thing to build. It usually appears after you have:
- tool calling
- a workflow or agent loop
- policy / guardrails
- a task set you expect to revisit
Earlier chapters build structure. Eval harnesses make structure changes regression-checkable.