Skill Improvement Workflow
Introduction
Section titled “Introduction”AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.
This guide teaches the core manual loop. For automated iteration that runs the full cycle hands-free, see agentv-bench.
The Core Loop
Section titled “The Core Loop”Every skill improvement follows the same cycle:
┌─────────────────┐│ Write Scenarios │└────────┬────────┘ ▼┌─────────────────┐│ Run Baseline │◄──────────────────┐└────────┬────────┘ │ ▼ │┌─────────────────┐ ││ Run Candidate │ │└────────┬────────┘ │ ▼ │┌─────────────────┐ ││ Compare │ │└────────┬────────┘ │ ▼ │┌─────────────────┐ ││ Review Failures │ │└────────┬────────┘ │ ▼ │┌─────────────────┐ ││ Improve Skill │────── Re-run ─────┘└─────────────────┘- Write test scenarios that capture what the skill should do
- Run a baseline evaluation without the skill (or with the previous version)
- Run a candidate evaluation with the new or updated skill
- Compare the two runs to see what improved and what regressed
- Review failures to understand why specific cases failed
- Improve the skill based on failure analysis
- Re-run and iterate until the candidate consistently beats the baseline
Step 1: Write Test Scenarios
Section titled “Step 1: Write Test Scenarios”Start with evals.json for quick iteration. It’s the simplest format and works directly with AgentV — no conversion needed.
{ "skill_name": "code-reviewer", "evals": [ { "id": 1, "prompt": "Review this Python function for bugs:\n\ndef divide(a, b):\n return a / b", "expected_output": "The function should handle division by zero.", "assertions": [ "Identifies the division by zero risk", "Suggests adding error handling" ] }, { "id": 2, "prompt": "Review this function:\n\ndef greet(name):\n return f'Hello, {name}!'", "expected_output": "The function is simple and correct.", "assertions": [ "Does not flag false issues", "Acknowledges the function is straightforward" ] } ]}For assisted authoring, use the agentv-eval-builder skill — it knows the full eval file schema and can generate test cases from descriptions.
Step 2: Run Baseline Evaluation
Section titled “Step 2: Run Baseline Evaluation”Run the evaluation without the skill loaded to establish a baseline:
agentv eval evals.json --target baselineThis produces a results file (e.g., results-baseline.jsonl) showing how the agent performs on its own.
Baseline isolation
Section titled “Baseline isolation”Skills in .claude/skills/ are auto-loaded by progressive disclosure. This means your baseline may accidentally include the skill you’re testing.
Workaround: Develop skills outside discovery paths during the evaluation cycle. Keep your skill-in-progress in a working directory (e.g., drafts/) and only move it to .claude/skills/ when you’re satisfied with the evaluation results.
# Skill lives outside the discovery path during developmentdrafts/ my-skill/ SKILL.md
# Baseline run won't pick it upagentv eval evals.json --target baselineStep 3: Run Candidate Evaluation
Section titled “Step 3: Run Candidate Evaluation”Run the same evaluation with the skill loaded:
agentv eval evals.json --target candidateOr if using agent mode (no API keys required):
# List available test casesagentv prompt eval --list evals.json
# Get the input prompt for a test caseagentv prompt eval --input evals.json --test-id 1
# After running the agent, fetch expected output and evaluator criteriaagentv prompt eval --expected-output evals.json --test-id 1Agent mode is useful when you want to evaluate skills with agents that don’t have a direct API integration — you orchestrate the run yourself and use AgentV’s accessor commands to read the eval spec.
If you’re using the agentv-bench skill bundle, the equivalent wrappers are:
cd plugins/agentv-dev/skills/agentv-benchbun installbun scripts/quick-validate.ts --scope wrappersbun scripts/prompt-eval.ts --list evals.jsonbun scripts/prompt-eval.ts --input evals.json --test-id 1bun scripts/prompt-eval.ts --expected-output evals.json --test-id 1Step 4: Compare Results
Section titled “Step 4: Compare Results”Compare the baseline and candidate runs:
agentv compare results-baseline.jsonl results-candidate.jsonlThe comparison output shows:
- Per-test score deltas — which cases improved, regressed, or stayed the same
- Aggregate statistics — overall pass rate change, mean score shift
- Regressions — cases that were passing before but now fail (these need immediate attention)
Look for:
- ✅ Net positive delta — more cases improved than regressed
- ⚠️ Any regressions — even one regression deserves investigation
- 📊 Score distribution — are improvements concentrated or spread across cases?
Step 5: Review Failures
Section titled “Step 5: Review Failures”Use trace inspection to understand why specific cases failed:
agentv trace show <trace-id>When reviewing failures, categorize them:
| Category | Description | Action |
|---|---|---|
| True failure | The skill genuinely handled the case wrong | Improve the skill |
| False positive | Got a passing score but the answer was wrong | Tighten assertions |
| False negative | Correct answer but scored as failing | Fix the evaluation criteria |
| Systematic pattern | Multiple failures share the same root cause | Address the pattern, not individual cases |
Systematic patterns are the highest-value findings. A single skill improvement that fixes a pattern can resolve multiple test failures at once.
Step 6: Improve the Skill
Section titled “Step 6: Improve the Skill”Apply targeted improvements based on your failure analysis:
- Keep changes small and testable. One improvement per iteration makes it easy to attribute score changes.
- Document what changed and why. A brief note in your commit message helps when reviewing the improvement history.
- Address systematic patterns first. These give the best return on effort.
<!-- Example: skill change log in a commit message -->fix(code-reviewer): handle edge case for single-line functions
The skill was flagging all single-line functions as "too terse" even whenthey were appropriate (e.g., simple getters). Added context-aware lengthassessment.
Failure pattern: tests 2, 5, 8 all failed with false-positive complexity warnings.Step 7: Re-run and Iterate
Section titled “Step 7: Re-run and Iterate”Loop back to Step 3 with the improved skill:
# Run the improved candidateagentv eval evals.json --target candidate
# Compare against the previous baselineagentv compare results-baseline.jsonl results-candidate.jsonlEach iteration should show:
- Previous regressions resolved
- No new regressions introduced
- Steady improvement in overall pass rate
Graduating to EVAL.yaml
Section titled “Graduating to EVAL.yaml”When evals.json becomes limiting — you need workspace isolation, code graders, tool trajectory checks, or multi-turn conversations — graduate to EVAL.yaml:
agentv convert evals.json -o eval.yamlThe generated YAML preserves all your existing test cases and adds comments showing AgentV features you can use:
# Converted from Agent Skills evals.jsontests: - id: "1" criteria: |- The function should handle division by zero. input: - role: user content: "Review this Python function for bugs:..." assertions: - name: assertion-1 type: llm-grader prompt: "Identifies the division by zero risk" # Replace with type: contains for deterministic checks: # - type: contains # value: "ZeroDivisionError"After converting, you can:
- Replace
llm-graderassertions with faster deterministic evaluators (contains,regex,equals) - Add
workspaceconfiguration for file-system isolation - Use
code-graderfor custom scoring logic - Define
tool-trajectoryassertions to check tool usage patterns
See Skill Evals (evals.json) for the full field mapping and side-by-side comparison.
Migration from Skill-Creator
Section titled “Migration from Skill-Creator”If you’ve been using the Agent Skills skill-creator workflow, AgentV reads your existing files directly — no rewrite needed.
| Skill-Creator | AgentV | Notes |
|---|---|---|
evals.json | agentv eval evals.json | Direct — no conversion needed |
claude -p "prompt" | agentv eval evals.json --target claude | Same eval, richer engine |
grading.json (read) | grading.json (write) | Same schema, AgentV produces it |
benchmark.json (read) | benchmark.json (write) | Same schema, AgentV produces it |
| with-skill vs without-skill | --target baseline --target candidate | Structured comparison |
| Graduate to richer evals | agentv convert evals.json → EVAL.yaml | Adds workspace, code graders, etc. |
Key takeaway: You do not need to rewrite your evals.json. AgentV reads it directly and adds a richer evaluation engine on top.
Baseline Comparison Best Practices
Section titled “Baseline Comparison Best Practices”Discovery-path contamination
Section titled “Discovery-path contamination”Skills placed in .claude/skills/ are auto-discovered and loaded into every agent session. This means your baseline run may unknowingly include the skill you’re trying to evaluate.
Mitigation strategies:
- Develop outside discovery paths — keep skills in
drafts/orwip/during evaluation - Use explicit target configurations — configure baseline and candidate targets with different skill sets
- Verify baseline purity — run a smoke test to confirm the baseline agent doesn’t reference your skill
Packaging guidance
Section titled “Packaging guidance”When distributing skills, exclude evaluation files from the distributable package:
my-skill/ SKILL.md # ✅ distribute evals/ # ❌ exclude from distribution evals.json eval.yaml results/Evals are development-time artifacts. End users don’t need them, and including them adds unnecessary weight to the package.
Progressive disclosure for skill authoring
Section titled “Progressive disclosure for skill authoring”Start simple and add complexity only when the evaluation results demand it:
- Start with
evals.json— 5-10 test cases, natural-language assertions - Add deterministic checks — when you find assertions that can be exact (
contains,regex) - Graduate to EVAL.yaml — when you need workspace isolation or code graders
- Add tool trajectory checks — when tool usage patterns matter
- Use rubrics — when you need weighted, structured scoring criteria
Automated Iteration
Section titled “Automated Iteration”For users who want the full automated improvement cycle, the agentv-bench skill runs a 5-phase optimization loop:
- Analyze — examines the current skill and evaluation results
- Hypothesize — generates improvement hypotheses from failure patterns
- Implement — applies targeted skill modifications
- Evaluate — re-runs the evaluation suite
- Decide — keeps improvements that help, reverts those that don’t
The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you’re comfortable with the evaluation workflow.
Its bundled scripts map directly onto the workflow stages:
run-eval.tsandcompare-runs.tsrun and compare evaluations while still delegating toagentvrun-loop.tsrepeats the evaluation loop without moving evaluator logic into the script layeraggregate-benchmark.tsandgenerate-report.tssummarize AgentV artifacts into review-friendly outputimprove-description.tsproposes follow-up description experiments once execution quality is stable
Code-grader execution, grading semantics, and artifact schemas still live in AgentV core. The scripts layer is orchestration glue over those existing primitives.