Skill Improvement Workflow

Introduction

AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.

This guide teaches the core manual loop. For automated iteration that runs the full cycle hands-free, see agentv-bench.

The Core Loop

Every skill improvement follows the same cycle:

┌─────────────────┐
│ Write Scenarios  │
└────────┬────────┘
         ▼
┌─────────────────┐
│  Run Baseline    │◄──────────────────┐
└────────┬────────┘                    │
         ▼                             │
┌─────────────────┐                    │
│  Run Candidate   │                   │
└────────┬────────┘                    │
         ▼                             │
┌─────────────────┐                    │
│    Compare       │                   │
└────────┬────────┘                    │
         ▼                             │
┌─────────────────┐                    │
│ Review Failures  │                   │
└────────┬────────┘                    │
         ▼                             │
┌─────────────────┐                    │
│ Improve Skill    │────── Re-run ─────┘
└─────────────────┘

Write test scenarios that capture what the skill should do
Run a baseline evaluation without the skill (or with the previous version)
Run a candidate evaluation with the new or updated skill
Compare the two runs to see what improved and what regressed
Review failures to understand why specific cases failed
Improve the skill based on failure analysis
Re-run and iterate until the candidate consistently beats the baseline

Step 1: Write Test Scenarios

Start with evals.json for quick iteration. It’s the simplest format and works directly with AgentV — no conversion needed.

{
  "skill_name": "code-reviewer",
  "evals": [
    {
      "id": 1,
      "prompt": "Review this Python function for bugs:\n\ndef divide(a, b):\n    return a / b",
      "expected_output": "The function should handle division by zero.",
      "assertions": [
        "Identifies the division by zero risk",
        "Suggests adding error handling"
      ]
    },
    {
      "id": 2,
      "prompt": "Review this function:\n\ndef greet(name):\n    return f'Hello, {name}!'",
      "expected_output": "The function is simple and correct.",
      "assertions": [
        "Does not flag false issues",
        "Acknowledges the function is straightforward"
      ]
    }
  ]
}

For assisted authoring, use the agentv-eval-builder skill — it knows the full eval file schema and can generate test cases from descriptions.

Step 2: Run Baseline Evaluation

Run the evaluation without the skill loaded to establish a baseline:

agentv eval evals.json --target baseline

This produces a results file (e.g., results-baseline.jsonl) showing how the agent performs on its own.

Baseline isolation

Skills in .claude/skills/ are auto-loaded by progressive disclosure. This means your baseline may accidentally include the skill you’re testing.

Workaround: Develop skills outside discovery paths during the evaluation cycle. Keep your skill-in-progress in a working directory (e.g., drafts/) and only move it to .claude/skills/ when you’re satisfied with the evaluation results.

# Skill lives outside the discovery path during development
drafts/
  my-skill/
    SKILL.md

# Baseline run won't pick it up
agentv eval evals.json --target baseline

Step 3: Run Candidate Evaluation

Run the same evaluation with the skill loaded:

agentv eval evals.json --target candidate

Or if using agent mode (no API keys required):

# List available test cases
agentv prompt eval --list evals.json

# Get the input prompt for a test case
agentv prompt eval --input evals.json --test-id 1

# After running the agent, fetch expected output and evaluator criteria
agentv prompt eval --expected-output evals.json --test-id 1

Agent mode is useful when you want to evaluate skills with agents that don’t have a direct API integration — you orchestrate the run yourself and use AgentV’s accessor commands to read the eval spec.

If you’re using the agentv-bench skill bundle, the equivalent wrappers are:

cd plugins/agentv-dev/skills/agentv-bench
bun install
bun scripts/quick-validate.ts --scope wrappers
bun scripts/prompt-eval.ts --list evals.json
bun scripts/prompt-eval.ts --input evals.json --test-id 1
bun scripts/prompt-eval.ts --expected-output evals.json --test-id 1

Step 4: Compare Results

Compare the baseline and candidate runs:

agentv compare results-baseline.jsonl results-candidate.jsonl

The comparison output shows:

Per-test score deltas — which cases improved, regressed, or stayed the same
Aggregate statistics — overall pass rate change, mean score shift
Regressions — cases that were passing before but now fail (these need immediate attention)

Look for:

✅ Net positive delta — more cases improved than regressed
⚠️ Any regressions — even one regression deserves investigation
📊 Score distribution — are improvements concentrated or spread across cases?

Step 5: Review Failures

Use trace inspection to understand why specific cases failed:

agentv trace show <trace-id>

When reviewing failures, categorize them:

Category	Description	Action
True failure	The skill genuinely handled the case wrong	Improve the skill
False positive	Got a passing score but the answer was wrong	Tighten assertions
False negative	Correct answer but scored as failing	Fix the evaluation criteria
Systematic pattern	Multiple failures share the same root cause	Address the pattern, not individual cases

Systematic patterns are the highest-value findings. A single skill improvement that fixes a pattern can resolve multiple test failures at once.

Step 6: Improve the Skill

Apply targeted improvements based on your failure analysis:

Keep changes small and testable. One improvement per iteration makes it easy to attribute score changes.
Document what changed and why. A brief note in your commit message helps when reviewing the improvement history.
Address systematic patterns first. These give the best return on effort.

<!-- Example: skill change log in a commit message -->
fix(code-reviewer): handle edge case for single-line functions

The skill was flagging all single-line functions as "too terse" even when
they were appropriate (e.g., simple getters). Added context-aware length
assessment.

Failure pattern: tests 2, 5, 8 all failed with false-positive complexity warnings.

Step 7: Re-run and Iterate

Loop back to Step 3 with the improved skill:

# Run the improved candidate
agentv eval evals.json --target candidate

# Compare against the previous baseline
agentv compare results-baseline.jsonl results-candidate.jsonl

Each iteration should show:

Previous regressions resolved
No new regressions introduced
Steady improvement in overall pass rate

Graduating to EVAL.yaml

When evals.json becomes limiting — you need workspace isolation, code graders, tool trajectory checks, or multi-turn conversations — graduate to EVAL.yaml:

agentv convert evals.json -o eval.yaml

The generated YAML preserves all your existing test cases and adds comments showing AgentV features you can use:

# Converted from Agent Skills evals.json
tests:
  - id: "1"
    criteria: |-
      The function should handle division by zero.
    input:
      - role: user
        content: "Review this Python function for bugs:..."
    assertions:
      - name: assertion-1
        type: llm-grader
        prompt: "Identifies the division by zero risk"
      # Replace with type: contains for deterministic checks:
      # - type: contains
      #   value: "ZeroDivisionError"

After converting, you can:

Replace llm-grader assertions with faster deterministic evaluators (contains, regex, equals)
Add workspace configuration for file-system isolation
Use code-grader for custom scoring logic
Define tool-trajectory assertions to check tool usage patterns

See Skill Evals (evals.json) for the full field mapping and side-by-side comparison.

Migration from Skill-Creator

If you’ve been using the Agent Skills skill-creator workflow, AgentV reads your existing files directly — no rewrite needed.

Skill-Creator	AgentV	Notes
`evals.json`	`agentv eval evals.json`	Direct — no conversion needed
`claude -p "prompt"`	`agentv eval evals.json --target claude`	Same eval, richer engine
`grading.json` (read)	`grading.json` (write)	Same schema, AgentV produces it
`benchmark.json` (read)	`benchmark.json` (write)	Same schema, AgentV produces it
with-skill vs without-skill	`--target baseline --target candidate`	Structured comparison
Graduate to richer evals	`agentv convert evals.json` → EVAL.yaml	Adds workspace, code graders, etc.

Key takeaway: You do not need to rewrite your evals.json. AgentV reads it directly and adds a richer evaluation engine on top.

Baseline Comparison Best Practices

Discovery-path contamination

Skills placed in .claude/skills/ are auto-discovered and loaded into every agent session. This means your baseline run may unknowingly include the skill you’re trying to evaluate.

Mitigation strategies:

Develop outside discovery paths — keep skills in drafts/ or wip/ during evaluation
Use explicit target configurations — configure baseline and candidate targets with different skill sets
Verify baseline purity — run a smoke test to confirm the baseline agent doesn’t reference your skill

Packaging guidance

When distributing skills, exclude evaluation files from the distributable package:

my-skill/
  SKILL.md           # ✅ distribute
  evals/             # ❌ exclude from distribution
    evals.json
    eval.yaml
    results/

Evals are development-time artifacts. End users don’t need them, and including them adds unnecessary weight to the package.

Progressive disclosure for skill authoring

Start simple and add complexity only when the evaluation results demand it:

Start with evals.json — 5-10 test cases, natural-language assertions
Add deterministic checks — when you find assertions that can be exact (contains, regex)
Graduate to EVAL.yaml — when you need workspace isolation or code graders
Add tool trajectory checks — when tool usage patterns matter
Use rubrics — when you need weighted, structured scoring criteria

Automated Iteration

For users who want the full automated improvement cycle, the agentv-bench skill runs a 5-phase optimization loop:

Analyze — examines the current skill and evaluation results
Hypothesize — generates improvement hypotheses from failure patterns
Implement — applies targeted skill modifications
Evaluate — re-runs the evaluation suite
Decide — keeps improvements that help, reverts those that don’t

The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you’re comfortable with the evaluation workflow.

Its bundled scripts map directly onto the workflow stages:

run-eval.ts and compare-runs.ts run and compare evaluations while still delegating to agentv
run-loop.ts repeats the evaluation loop without moving evaluator logic into the script layer
aggregate-benchmark.ts and generate-report.ts summarize AgentV artifacts into review-friendly output
improve-description.ts proposes follow-up description experiments once execution quality is stable

Code-grader execution, grading semantics, and artifact schemas still live in AgentV core. The scripts layer is orchestration glue over those existing primitives.