Skip to content

Human Review Checkpoint

Human review sits between automated scoring and the next iteration. Automated evaluators catch regressions and enforce thresholds, but a human reviewer spots score-behavior mismatches, qualitative regressions, and cases where a grader is too strict or too lenient.

Review after every eval run where you plan to iterate on the skill or agent. The workflow:

  1. Run evalsagentv eval EVAL.yaml or agentv eval evals.json
  2. Inspect results — open the HTML report or scan the results JSONL
  3. Write feedback — create feedback.json alongside the results
  4. Iterate — use the feedback to guide prompt changes, evaluator tuning, or test case additions
  5. Re-run — verify improvements in the next eval run

Skip the review step for routine CI gate runs where you only need pass/fail.

SignalExample
Score-behavior mismatchA test scores 0.9 but the output is clearly wrong — the grader missed an error
False positiveA contains check passes on a coincidental substring match
False negativeAn LLM grader penalizes a correct answer that uses different phrasing
Qualitative regressionScores stay the same but tone, formatting, or helpfulness degrades
Evaluator miscalibrationA code grader is too strict on whitespace; a rubric is too lenient on accuracy
Flaky resultsThe same test produces wildly different scores across runs

For workspace evaluations (EVAL.yaml), use the trace viewer:

Terminal window
# View traces from a specific run
agentv trace show results/2026-03-14T10-32-00_claude/traces.jsonl
# View the HTML report (if generated via #562)
open results/2026-03-14T10-32-00_claude/report.html

For simple skill evaluations (evals.json), scan the results JSONL:

Terminal window
# Show failing tests
cat results/output.jsonl | jq 'select(.score < 0.8)'
# Show all scores
cat results/output.jsonl | jq '{id: .testId, score: .score, verdict: .verdict}'

Create a feedback.json file in the results directory, alongside results.jsonl or output.jsonl:

results/
2026-03-14T10-32-00_claude/
results.jsonl # automated eval results
traces.jsonl # execution traces
feedback.json # ← your review annotations

The feedback.json file is a structured annotation of a single eval run. It records the reviewer’s qualitative assessment alongside the automated scores.

{
"run_id": "2026-03-14T10-32-00_claude",
"reviewer": "engineer-name",
"timestamp": "2026-03-14T12:00:00Z",
"overall_notes": "Retrieval tests need more diverse queries. Code grader for format-check is too strict on trailing newlines.",
"per_case": [
{
"test_id": "test-feature-alpha",
"verdict": "acceptable",
"notes": "Score is borderline (0.72) but behavior is correct — the grader penalized for different phrasing."
},
{
"test_id": "test-retrieval-basic",
"verdict": "needs_improvement",
"notes": "Missing coverage of multi-document queries.",
"evaluator_overrides": {
"code-grader:format-check": "Too strict — penalized valid output with trailing newline",
"llm-grader:quality": "Score 0.6 seems fair, answer was incomplete"
},
"workspace_notes": "Workspace had stale cached files from previous run — may have affected retrieval results."
},
{
"test_id": "test-edge-case-empty",
"verdict": "flaky",
"notes": "Passed on 2 of 3 runs. Likely non-determinism in the agent's tool selection."
}
]
}
FieldTypeRequiredDescription
run_idstringyesIdentifies the eval run (matches the results directory name or run identifier)
reviewerstringyesWho performed the review
timestampstring (ISO 8601)yesWhen the review was completed
overall_notesstringnoHigh-level observations about the run
per_casearraynoPer-test-case annotations
FieldTypeRequiredDescription
test_idstringyesMatches the test id from the eval file
verdictenumyesOne of: acceptable, needs_improvement, incorrect, flaky
notesstringnoFree-form reviewer notes
evaluator_overridesobjectnoKeyed by evaluator name — reviewer annotations on specific evaluator results
workspace_notesstringnoNotes about workspace state (relevant for workspace evaluations)
VerdictMeaning
acceptableAutomated score and actual behavior are both satisfactory
needs_improvementThe output or coverage needs work — not a bug, but not good enough
incorrectThe output is wrong, regardless of what the automated score says
flakyResults are inconsistent across runs — investigate non-determinism

Evaluator overrides (workspace evaluations)

Section titled “Evaluator overrides (workspace evaluations)”

For workspace evaluations with multiple evaluators (code graders, LLM graders, tool trajectory checks), the evaluator_overrides field lets the reviewer annotate specific evaluator results:

{
"test_id": "test-refactor-api",
"verdict": "needs_improvement",
"evaluator_overrides": {
"code-grader:test-pass": "Tests pass but the refactored code has a subtle race condition the tests don't cover",
"llm-grader:quality": "Score 0.9 is too high — the agent left dead code behind",
"tool-trajectory:efficiency": "Used 12 tool calls where 5 would suffice, but the result is correct"
},
"workspace_notes": "Agent cloned the repo correctly but didn't clean up temp files."
}

Keys use the format evaluator-type:evaluator-name to match the evaluators defined in assert blocks.

Keep feedback files alongside results to build a history of review decisions:

results/
2026-03-12T09-00-00_claude/
results.jsonl
feedback.json # first iteration review
2026-03-14T10-32-00_claude/
results.jsonl
feedback.json # second iteration review
2026-03-15T16-00-00_claude/
results.jsonl
feedback.json # third iteration review

This creates a traceable record of what changed between iterations and why. When debugging a regression, check previous feedback.json files to see if the issue was noted before.

The review checkpoint fits into the broader eval iteration loop:

Define tests (EVAL.yaml / evals.json)
Run automated evals
Review results ← you are here
Write feedback.json
Tune prompts / evaluators / test cases
Re-run evals
Compare with previous run (agentv compare)
Review again (if iterating)

Use agentv compare to quantify changes between runs, then review the diff to confirm that score improvements reflect genuine behavioral improvements.