Eval Harness (Regression Tests for Agents)
What Problem It Solves
Agents are non-deterministic programs: small prompt/tool/policy changes can silently break behavior.
An eval harness provides:
- a fixed set of tasks (offline-first)
- repeatable scoring (pass/fail + metrics)
- trace artifacts for debugging regressions
When to Use
- You ship agents and want “CI for agent behavior”.
- You add new patterns, tools, or guardrails and need confidence.
- You want to compare variants (e.g., ReAct vs. Plan & Solve) on the same task set.
Core Flow
flowchart TD
T["Select tasks"] --> R["Run runner (offline model or real SDK)"]
R --> S["Score + aggregate metrics"]
S --> O["Write report + traces"]
O --> C["Compare with baseline"]
Repo Reference
- CLI:
src/agent_patterns_lab/runtime/evals/__main__.py - Tasks:
src/agent_patterns_lab/runtime/evals/tasks.py - Runner:
src/agent_patterns_lab/runtime/evals/runner.py - Report:
src/agent_patterns_lab/runtime/evals/report.py - Tests:
tests/test_evals_runner.py