Skip to content

Running Evaluations

Terminal window
agentv eval evals/my-eval.yaml

Results are written to .agentv/results/eval_<timestamp>.jsonl. Each line is a JSON object with one result per test case.

Each scores[] entry includes per-grader timing:

{
"scores": [
{
"name": "format_structure",
"type": "llm-grader",
"score": 0.9,
"verdict": "pass",
"hits": ["clear structure"],
"misses": [],
"duration_ms": 9103,
"started_at": "2026-03-09T00:05:10.123Z",
"ended_at": "2026-03-09T00:05:19.226Z",
"token_usage": { "input": 2711, "output": 2535 }
}
]
}

The duration_ms, started_at, and ended_at fields are present on every grader result (including code-grader), enabling per-grader bottleneck analysis.

Run against a different target than specified in the eval file:

Terminal window
agentv eval --target azure-base evals/**/*.yaml

Run a single test by ID:

Terminal window
agentv eval --test-id case-123 evals/my-eval.yaml

Test the harness flow with mock responses (does not call real providers):

Terminal window
agentv eval --dry-run evals/my-eval.yaml
Terminal window
agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

Terminal window
# Human-readable JSONL trace (one record per test case)
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl
# OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json
# Both formats simultaneously
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl --otel-file traces/eval.otlp.json

The --trace-file format writes JSONL records containing:

  • test_id - The test identifier
  • target / score - Target and evaluation score
  • duration_ms - Total execution duration
  • spans - Array of tool invocations with timing
  • token_usage / cost_usd - Resource consumption

The --otel-file format writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Stream traces directly to an observability backend during evaluation using --export-otel:

Terminal window
# Use a backend preset (braintrust, langfuse, confident)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust
# Include message content and tool I/O in spans (disabled by default for privacy)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content
# Group messages into turn spans for multi-turn evaluations
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-group-turns

Set up your environment:

Terminal window
export BRAINTRUST_API_KEY=sk-...
export BRAINTRUST_PROJECT=my-project # associates traces with a Braintrust project

Run an eval with traces sent to Braintrust:

Terminal window
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

The following environment variables control project association (at least one is required):

VariableFormatExample
BRAINTRUST_PROJECTProject namemy-evals
BRAINTRUST_PROJECT_IDProject UUIDproj_abc123
BRAINTRUST_PARENTRaw x-bt-parent headerproject_name:my-evals

Each eval test case produces a trace with:

  • Root span (agentv.eval) — test ID, target, score, duration
  • LLM call spans (chat <model>) — model name, token usage (input/output/cached)
  • Tool call spans (execute_tool <name>) — tool name, arguments, results (with --otel-capture-content)
  • Turn spans (agentv.turn.N) — groups messages by conversation turn (with --otel-group-turns)
  • Evaluator events — per-grader scores attached to the root span
Terminal window
export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# Optional: export LANGFUSE_HOST=https://cloud.langfuse.com
agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-content

For backends not covered by presets, configure via environment variables:

Terminal window
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/traces
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"
agentv eval evals/my-eval.yaml --export-otel

Use workspace mode and finish policies instead of multiple conflicting booleans:

Terminal window
# Mode: pooled | temp | static
agentv eval evals/my-eval.yaml --workspace-mode pooled
# Static mode path
agentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace
# Pooled reset policy override: standard | full (CLI override)
agentv eval evals/my-eval.yaml --workspace-clean full
# Finish policy overrides: keep | cleanup (CLI)
agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keep

Equivalent eval YAML:

workspace:
mode: pooled # pooled | temp | static
path: null # workspace path for mode=static; auto-materialised when empty/missing
hooks:
enabled: true # set false to skip all hooks
after_each:
reset: fast # none | fast | strict

Notes:

  • Pooling is default for shared workspaces with repos when mode is not specified.
  • mode: static (or --workspace-mode static) uses path / --workspace-path. When the path is empty or missing, the workspace is auto-materialised (template copied + repos cloned). Populated directories are reused as-is.
  • Static mode is incompatible with isolation: per_test.
  • hooks.enabled: false skips all lifecycle hooks (setup, teardown, reset).
  • Pool slots are managed separately (agentv workspace list|clean).

Re-run only the tests that had infrastructure/execution errors from a previous output:

Terminal window
agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
fail_on_error: false # never halt on errors (default)
# fail_on_error: true # halt on first execution error
ValueBehavior
trueHalt immediately on first execution error
falseContinue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Check eval files for schema errors without executing:

Terminal window
agentv validate evals/my-eval.yaml

Run a code-grader assertion in isolation without executing a full eval suite:

Terminal window
agentv eval assert <name> --agent-output <text> --agent-input <text>

The command discovers the assertion script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}, then passes the input via stdin and prints the result JSON to stdout.

Terminal window
# Run an assertion with inline arguments
agentv eval assert rouge-score \
--agent-output "The fox jumps over the lazy dog" \
--agent-input "Summarise the article"
# Or pass a JSON payload file
agentv eval assert rouge-score --file result.json

The --file option reads a JSON file with { "output": "...", "input": "..." } fields.

Exit codes: 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits assert instructions for code graders so external grading agents can execute them directly.

Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.

Terminal window
agentv eval prompt eval --list evals/my-eval.yaml

Returns JSON listing the available test_ids for the eval file.

Terminal window
agentv eval prompt eval --input evals/my-eval.yaml --test-id case-123

Returns JSON with:

  • input[{role, content}] array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.
  • guideline_paths — files containing additional instructions to prepend to the system message.
  • criteria — grading criteria for the orchestrator’s reference (do not pass to the candidate).
Terminal window
agentv eval prompt eval --expected-output evals/my-eval.yaml --test-id case-123

Returns JSON with the data an external grader needs:

  • expected_output — reference assistant messages
  • reference_answer — flattened reference text when available
  • criteria — high-level success criteria
  • assertions — evaluator configs for the test

Output a human-readable summary of the grading criteria for a specific test, with type-prefixed assertion tags:

Terminal window
agentv eval prompt eval --grading-brief evals/my-eval.yaml --test-id case-123

Example output:

Input: "Summarise the following article in one sentence."
Expected: "The quick brown fox jumps over the lazy dog near the river bank."
Criteria:
- [code-grader] rouge-score: Measures n-gram recall and F1
- [llm-grader] Summary captures key points
- [skill-trigger] should_trigger: true for summariser

This is useful for agents orchestrating evals to understand what criteria a test is evaluated against before running it.

ScenarioCommand
Have API keys, want end-to-end automationagentv eval
Run a single assertion in isolationagentv eval assert <name>
No API keys, external agent can orchestrate the runagentv eval prompt eval --list/--input/--expected-output
Inspect grading criteria before runningagentv eval prompt eval --grading-brief

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

ConditionInteractive (TTY)Non-interactive (CI)
Version satisfies rangeRuns silentlyRuns silently
Version below rangeWarns + prompts to continueWarns to stderr, continues
--strict flag + mismatchWarns + exits 1Warns + exits 1
No required_version setRuns silentlyRuns silently
Malformed semver rangeError + exits 1Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

Terminal window
agentv eval --strict evals/my-eval.yaml

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

execution:
verbose: true
trace_file: .agentv/results/trace-{timestamp}.jsonl
keep_workspaces: false
otel_file: .agentv/results/otel-{timestamp}.json
FieldCLI equivalentTypeDefaultDescription
verbose--verbosebooleanfalseEnable verbose logging
trace_file--trace-filestringnoneWrite human-readable trace JSONL
keep_workspaces--keep-workspacesbooleanfalseAlways keep temp workspaces after eval
otel_file--otel-filestringnoneWrite OTLP JSON trace to file
import { defineConfig } from '@agentv/core';
export default defineConfig({
execution: {
verbose: true,
traceFile: '.agentv/results/trace-{timestamp}.jsonl',
keepWorkspaces: false,
otelFile: '.agentv/results/otel-{timestamp}.json',
},
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):

Terminal window
# Linux/macOS
export AGENTV_HOME=/data/agentv
# Windows (PowerShell)
$env:AGENTV_HOME = "D:\agentv"
# Windows (CMD)
set AGENTV_HOME=D:\agentv

When set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.