Skill Evals (evals.json)
Overview
Section titled “Overview”Agent Skills is an open standard for describing AI agent capabilities. Its evals.json format defines simple test cases for skills — a prompt, expected output, and natural-language assertions.
AgentV natively supports evals.json. You can run Agent Skills evals directly:
agentv eval evals.json --target claudeWhen you need AgentV’s power features (deterministic evaluators, composite scoring, multi-turn conversations, workspace isolation), you can graduate to EVAL.yaml.
Quick start
Section titled “Quick start”Create evals.json:
{ "skill_name": "csv-analyzer", "evals": [ { "id": 1, "prompt": "I have a CSV of monthly sales data in evals/files/sales.csv. Find the top 3 months by revenue.", "expected_output": "The top 3 months by revenue are November ($22,500), September ($20,100), and December ($19,400).", "files": ["evals/files/sales.csv"], "assertions": [ "Output identifies November as the highest revenue month", "Output includes exactly 3 months", "Revenue figures are included for each month" ] } ]}Run it:
agentv eval evals.json --target claudeThe --target flag selects the agent harness. The agent evaluates itself — skills load naturally via progressive disclosure.
Field mapping
Section titled “Field mapping”When AgentV loads evals.json, it promotes fields to its internal representation:
| evals.json | EVAL.yaml equivalent | Notes |
|---|---|---|
prompt | input | Wrapped as [{role: "user", content: prompt}] |
expected_output | expected_output + criteria | Used as reference answer and evaluation criteria |
assertions[] | assert[] | Each string becomes {type: llm-grader, prompt: text} |
files[] | file_paths | Resolved relative to evals.json, copied into workspace |
skill_name | metadata.skill_name | Carried as metadata |
id (number) | id (string) | Converted via String(id) |
Files support
Section titled “Files support”The files[] field lists files that the agent needs during evaluation. Paths are relative to the evals.json location:
{ "evals": [ { "id": 1, "prompt": "Analyze the sales data", "files": ["evals/files/sales.csv", "evals/files/config.json"] } ]}AgentV resolves these paths and copies the files into the workspace before the agent runs. If a file is missing, the test case fails with a file_copy_error.
Agent mode (no API keys)
Section titled “Agent mode (no API keys)”AgentV’s prompt subcommands work with evals.json, enabling agent-mode evaluation without API keys:
# List test IDsagentv prompt eval --list evals.json
# Get input for a specific testagentv prompt eval --input evals.json --test-id 1
# Get expected output and evaluator criteria for a testagentv prompt eval --expected-output evals.json --test-id 1In agent mode, the host agent uses these accessors to enumerate tests, act as the candidate, and then grade the saved answer against the eval spec.
If you’re using the agentv-bench skill bundle, validate your evals before running:
cd plugins/agentv-dev/skills/agentv-benchpython scripts/quick_validate.py --eval evals/evals.jsonThe rest of the bundle follows the same pattern:
scripts/run_eval.pyruns evals viaclaude -pscripts/run_loop.pyiterates eval rounds automaticallyscripts/aggregate_benchmark.pyandscripts/generate_report.pyread AgentV artifactsscripts/improve_description.pyproposes description experiments from observed misses
Benchmark output
Section titled “Benchmark output”Generate an Agent Skills compatible benchmark.json alongside the standard result JSONL:
agentv eval evals.json --target claude --benchmark-json benchmark.jsonThe benchmark uses AgentV’s pass threshold (score >= 0.8) to map continuous scores to the binary pass/fail that Agent Skills pass_rate expects:
{ "run_summary": { "with_skill": { "pass_rate": {"mean": 0.83, "stddev": 0.06}, "time_seconds": {"mean": 45.0, "stddev": 12.0}, "tokens": {"mean": 3800, "stddev": 400} } }}Converting to EVAL.yaml
Section titled “Converting to EVAL.yaml”When you’re ready to graduate, convert your evals.json to EVAL.yaml:
# Output to stdoutagentv convert evals.json
# Write to fileagentv convert evals.json -o eval.yamlThe generated YAML includes comments about available AgentV features you can use:
# Converted from Agent Skills evals.json# AgentV features you can add:# - type: is_json, contains, regex for deterministic evaluators# - type: code-grader for custom scoring scripts# - Multi-turn conversations via input message arrays# - Composite evaluators with weighted scoring# - Workspace isolation with repos and hooks
tests: - id: "1" criteria: |- The top 3 months by revenue are November, September, and December. input: - role: user content: "Find the top 3 months by revenue." # Promoted from evals.json assertions[] # Replace with type: is_json, contains, or regex for deterministic checks assertions: - name: assertion-1 type: llm-grader prompt: "Output identifies November as the highest revenue month"Inside the agentv-bench bundle, use agentv convert directly:
agentv convert evals/evals.json --out EVAL.yamlWhen to stay with evals.json
Section titled “When to stay with evals.json”Use evals.json when:
- You’re building a skill and want quick feedback loops
- Your assertions are natural-language (“output includes a chart”, “response is polite”)
- You want compatibility with other Agent Skills tooling
- Tests don’t need workspace isolation or deterministic checks
When to graduate to EVAL.yaml
Section titled “When to graduate to EVAL.yaml”Switch to EVAL.yaml when you need:
- Deterministic evaluators:
contains,regex,equals,is-json— faster and cheaper than LLM graders - Composite scoring: Weighted evaluators with custom aggregation
- Multi-turn conversations: Multi-message input sequences
- Workspace isolation: Sandboxed file systems per test case
- Tool trajectory evaluation: Assert on the sequence of tool calls
- Matrix evaluation: Test across multiple targets simultaneously
Side-by-side comparison
Section titled “Side-by-side comparison”The same eval expressed in both formats:
evals.json
Section titled “evals.json”{ "skill_name": "support-agent", "evals": [ { "id": 1, "prompt": "A customer says their order #12345 hasn't arrived after 2 weeks. Help them.", "expected_output": "An empathetic response that offers to track the order and provides next steps.", "assertions": [ "Response acknowledges the customer's frustration", "Response offers to look up order #12345", "Response provides clear next steps" ] } ]}EVAL.yaml equivalent
Section titled “EVAL.yaml equivalent”tests: - id: "1" input: | A customer says their order #12345 hasn't arrived after 2 weeks. Help them. expected_output: | An empathetic response that offers to track the order and provides next steps. assertions: - name: acknowledges-frustration type: llm-grader prompt: Response acknowledges the customer's frustration - name: looks-up-order type: contains value: "12345" - name: has-next-steps type: llm-grader prompt: Response provides clear next stepsNotice how the EVAL.yaml version can mix llm-grader (for subjective checks) with contains (for deterministic checks) — the order number check is now instant and free.