Skip to content

Skill Improvement Workflow

AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.

This guide teaches the core manual loop. For automated iteration that runs the full cycle hands-free, see agentv-bench.

Every skill improvement follows the same cycle:

┌─────────────────┐
│ Write Scenarios │
└────────┬────────┘
┌─────────────────┐
│ Run Baseline │◄──────────────────┐
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Run Candidate │ │
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Compare │ │
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Review Failures │ │
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Improve Skill │────── Re-run ─────┘
└─────────────────┘
  1. Write test scenarios that capture what the skill should do
  2. Run a baseline evaluation without the skill (or with the previous version)
  3. Run a candidate evaluation with the new or updated skill
  4. Compare the two runs to see what improved and what regressed
  5. Review failures to understand why specific cases failed
  6. Improve the skill based on failure analysis
  7. Re-run and iterate until the candidate consistently beats the baseline

Start with evals.json for quick iteration. It’s the simplest format and works directly with AgentV — no conversion needed.

{
"skill_name": "code-reviewer",
"evals": [
{
"id": 1,
"prompt": "Review this Python function for bugs:\n\ndef divide(a, b):\n return a / b",
"expected_output": "The function should handle division by zero.",
"assertions": [
"Identifies the division by zero risk",
"Suggests adding error handling"
]
},
{
"id": 2,
"prompt": "Review this function:\n\ndef greet(name):\n return f'Hello, {name}!'",
"expected_output": "The function is simple and correct.",
"assertions": [
"Does not flag false issues",
"Acknowledges the function is straightforward"
]
}
]
}

For assisted authoring, use the agentv-eval-builder skill — it knows the full eval file schema and can generate test cases from descriptions.

Run the evaluation without the skill loaded to establish a baseline:

Terminal window
agentv eval evals.json --target baseline

This produces a results file (e.g., results-baseline.jsonl) showing how the agent performs on its own.

Skills in .claude/skills/ are auto-loaded by progressive disclosure. This means your baseline may accidentally include the skill you’re testing.

Workaround: Develop skills outside discovery paths during the evaluation cycle. Keep your skill-in-progress in a working directory (e.g., drafts/) and only move it to .claude/skills/ when you’re satisfied with the evaluation results.

Terminal window
# Skill lives outside the discovery path during development
drafts/
my-skill/
SKILL.md
# Baseline run won't pick it up
agentv eval evals.json --target baseline

Run the same evaluation with the skill loaded:

Terminal window
agentv eval evals.json --target candidate

Or if using agent mode (no API keys required):

Terminal window
# List available test cases
agentv prompt eval --list evals.json
# Get the input prompt for a test case
agentv prompt eval --input evals.json --test-id 1
# After running the agent, fetch expected output and evaluator criteria
agentv prompt eval --expected-output evals.json --test-id 1

Agent mode is useful when you want to evaluate skills with agents that don’t have a direct API integration — you orchestrate the run yourself and use AgentV’s accessor commands to read the eval spec.

If you’re using the agentv-bench skill bundle, the equivalent wrappers are:

Terminal window
cd plugins/agentv-dev/skills/agentv-bench
bun install
bun scripts/quick-validate.ts --scope wrappers
bun scripts/prompt-eval.ts --list evals.json
bun scripts/prompt-eval.ts --input evals.json --test-id 1
bun scripts/prompt-eval.ts --expected-output evals.json --test-id 1

Compare the baseline and candidate runs:

Terminal window
agentv compare results-baseline.jsonl results-candidate.jsonl

The comparison output shows:

  • Per-test score deltas — which cases improved, regressed, or stayed the same
  • Aggregate statistics — overall pass rate change, mean score shift
  • Regressions — cases that were passing before but now fail (these need immediate attention)

Look for:

  • Net positive delta — more cases improved than regressed
  • ⚠️ Any regressions — even one regression deserves investigation
  • 📊 Score distribution — are improvements concentrated or spread across cases?

Use trace inspection to understand why specific cases failed:

Terminal window
agentv trace show <trace-id>

When reviewing failures, categorize them:

CategoryDescriptionAction
True failureThe skill genuinely handled the case wrongImprove the skill
False positiveGot a passing score but the answer was wrongTighten assertions
False negativeCorrect answer but scored as failingFix the evaluation criteria
Systematic patternMultiple failures share the same root causeAddress the pattern, not individual cases

Systematic patterns are the highest-value findings. A single skill improvement that fixes a pattern can resolve multiple test failures at once.

Apply targeted improvements based on your failure analysis:

  • Keep changes small and testable. One improvement per iteration makes it easy to attribute score changes.
  • Document what changed and why. A brief note in your commit message helps when reviewing the improvement history.
  • Address systematic patterns first. These give the best return on effort.
<!-- Example: skill change log in a commit message -->
fix(code-reviewer): handle edge case for single-line functions
The skill was flagging all single-line functions as "too terse" even when
they were appropriate (e.g., simple getters). Added context-aware length
assessment.
Failure pattern: tests 2, 5, 8 all failed with false-positive complexity warnings.

Loop back to Step 3 with the improved skill:

Terminal window
# Run the improved candidate
agentv eval evals.json --target candidate
# Compare against the previous baseline
agentv compare results-baseline.jsonl results-candidate.jsonl

Each iteration should show:

  • Previous regressions resolved
  • No new regressions introduced
  • Steady improvement in overall pass rate

When evals.json becomes limiting — you need workspace isolation, code graders, tool trajectory checks, or multi-turn conversations — graduate to EVAL.yaml:

Terminal window
agentv convert evals.json -o eval.yaml

The generated YAML preserves all your existing test cases and adds comments showing AgentV features you can use:

# Converted from Agent Skills evals.json
tests:
- id: "1"
criteria: |-
The function should handle division by zero.
input:
- role: user
content: "Review this Python function for bugs:..."
assertions:
- name: assertion-1
type: llm-grader
prompt: "Identifies the division by zero risk"
# Replace with type: contains for deterministic checks:
# - type: contains
# value: "ZeroDivisionError"

After converting, you can:

  • Replace llm-grader assertions with faster deterministic evaluators (contains, regex, equals)
  • Add workspace configuration for file-system isolation
  • Use code-grader for custom scoring logic
  • Define tool-trajectory assertions to check tool usage patterns

See Skill Evals (evals.json) for the full field mapping and side-by-side comparison.

If you’ve been using the Agent Skills skill-creator workflow, AgentV reads your existing files directly — no rewrite needed.

Skill-CreatorAgentVNotes
evals.jsonagentv eval evals.jsonDirect — no conversion needed
claude -p "prompt"agentv eval evals.json --target claudeSame eval, richer engine
grading.json (read)grading.json (write)Same schema, AgentV produces it
benchmark.json (read)benchmark.json (write)Same schema, AgentV produces it
with-skill vs without-skill--target baseline --target candidateStructured comparison
Graduate to richer evalsagentv convert evals.json → EVAL.yamlAdds workspace, code graders, etc.

Key takeaway: You do not need to rewrite your evals.json. AgentV reads it directly and adds a richer evaluation engine on top.

Skills placed in .claude/skills/ are auto-discovered and loaded into every agent session. This means your baseline run may unknowingly include the skill you’re trying to evaluate.

Mitigation strategies:

  1. Develop outside discovery paths — keep skills in drafts/ or wip/ during evaluation
  2. Use explicit target configurations — configure baseline and candidate targets with different skill sets
  3. Verify baseline purity — run a smoke test to confirm the baseline agent doesn’t reference your skill

When distributing skills, exclude evaluation files from the distributable package:

my-skill/
SKILL.md # ✅ distribute
evals/ # ❌ exclude from distribution
evals.json
eval.yaml
results/

Evals are development-time artifacts. End users don’t need them, and including them adds unnecessary weight to the package.

Progressive disclosure for skill authoring

Section titled “Progressive disclosure for skill authoring”

Start simple and add complexity only when the evaluation results demand it:

  1. Start with evals.json — 5-10 test cases, natural-language assertions
  2. Add deterministic checks — when you find assertions that can be exact (contains, regex)
  3. Graduate to EVAL.yaml — when you need workspace isolation or code graders
  4. Add tool trajectory checks — when tool usage patterns matter
  5. Use rubrics — when you need weighted, structured scoring criteria

For users who want the full automated improvement cycle, the agentv-bench skill runs a 5-phase optimization loop:

  1. Analyze — examines the current skill and evaluation results
  2. Hypothesize — generates improvement hypotheses from failure patterns
  3. Implement — applies targeted skill modifications
  4. Evaluate — re-runs the evaluation suite
  5. Decide — keeps improvements that help, reverts those that don’t

The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you’re comfortable with the evaluation workflow.

Its bundled scripts map directly onto the workflow stages:

  • run-eval.ts and compare-runs.ts run and compare evaluations while still delegating to agentv
  • run-loop.ts repeats the evaluation loop without moving evaluator logic into the script layer
  • aggregate-benchmark.ts and generate-report.ts summarize AgentV artifacts into review-friendly output
  • improve-description.ts proposes follow-up description experiments once execution quality is stable

Code-grader execution, grading semantics, and artifact schemas still live in AgentV core. The scripts layer is orchestration glue over those existing primitives.