Running Evaluations

Run an Evaluation

agentv eval evals/my-eval.yaml

Results are written to .agentv/results/eval_<timestamp>.jsonl. Each line is a JSON object with one result per test case.

Each scores[] entry includes per-grader timing:

{
  "scores": [
    {
      "name": "format_structure",
      "type": "llm-grader",
      "score": 0.9,
      "verdict": "pass",
      "hits": ["clear structure"],
      "misses": [],
      "duration_ms": 9103,
      "started_at": "2026-03-09T00:05:10.123Z",
      "ended_at": "2026-03-09T00:05:19.226Z",
      "token_usage": { "input": 2711, "output": 2535 }
    }
  ]
}

The duration_ms, started_at, and ended_at fields are present on every grader result (including code-grader), enabling per-grader bottleneck analysis.

Common Options

Override Target

Run against a different target than specified in the eval file:

agentv eval --target azure-base evals/**/*.yaml

Run Specific Test

Run a single test by ID:

agentv eval --test-id case-123 evals/my-eval.yaml

Dry Run

Test the harness flow with mock responses (does not call real providers):

agentv eval --dry-run evals/my-eval.yaml

Output to Specific File

agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Trace Persistence

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

# Human-readable JSONL trace (one record per test case)
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl

# OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json

# Both formats simultaneously
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl --otel-file traces/eval.otlp.json

The --trace-file format writes JSONL records containing:

test_id - The test identifier
target / score - Target and evaluation score
duration_ms - Total execution duration
spans - Array of tool invocations with timing
token_usage / cost_usd - Resource consumption

The --otel-file format writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Live OTel Export

Stream traces directly to an observability backend during evaluation using --export-otel:

# Use a backend preset (braintrust, langfuse, confident)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust

# Include message content and tool I/O in spans (disabled by default for privacy)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

# Group messages into turn spans for multi-turn evaluations
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-group-turns

Braintrust

Set up your environment:

export BRAINTRUST_API_KEY=sk-...
export BRAINTRUST_PROJECT=my-project    # associates traces with a Braintrust project

Run an eval with traces sent to Braintrust:

agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

The following environment variables control project association (at least one is required):

Variable	Format	Example
`BRAINTRUST_PROJECT`	Project name	`my-evals`
`BRAINTRUST_PROJECT_ID`	Project UUID	`proj_abc123`
`BRAINTRUST_PARENT`	Raw `x-bt-parent` header	`project_name:my-evals`

Each eval test case produces a trace with:

Root span (agentv.eval) — test ID, target, score, duration
LLM call spans (chat <model>) — model name, token usage (input/output/cached)
Tool call spans (execute_tool <name>) — tool name, arguments, results (with --otel-capture-content)
Turn spans (agentv.turn.N) — groups messages by conversation turn (with --otel-group-turns)
Evaluator events — per-grader scores attached to the root span

Langfuse

export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# Optional: export LANGFUSE_HOST=https://cloud.langfuse.com

agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-content

Custom OTLP Endpoint

For backends not covered by presets, configure via environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/traces
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"

agentv eval evals/my-eval.yaml --export-otel

Workspace Modes and Finish Policy

Use workspace mode and finish policies instead of multiple conflicting booleans:

# Mode: pooled | temp | static
agentv eval evals/my-eval.yaml --workspace-mode pooled

# Static mode path
agentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace

# Pooled reset policy override: standard | full (CLI override)
agentv eval evals/my-eval.yaml --workspace-clean full

# Finish policy overrides: keep | cleanup (CLI)
agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keep

Equivalent eval YAML:

workspace:
  mode: pooled           # pooled | temp | static
  path: null             # workspace path for mode=static; auto-materialised when empty/missing
  hooks:
    enabled: true        # set false to skip all hooks
    after_each:
      reset: fast        # none | fast | strict

Notes:

Pooling is default for shared workspaces with repos when mode is not specified.
mode: static (or --workspace-mode static) uses path / --workspace-path. When the path is empty or missing, the workspace is auto-materialised (template copied + repos cloned). Populated directories are reused as-is.
Static mode is incompatible with isolation: per_test.
hooks.enabled: false skips all lifecycle hooks (setup, teardown, reset).
Pool slots are managed separately (agentv workspace list|clean).

Retry Execution Errors

Re-run only the tests that had infrastructure/execution errors from a previous output:

agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Execution Error Tolerance

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
  fail_on_error: false    # never halt on errors (default)
  # fail_on_error: true   # halt on first execution error

Value	Behavior
`true`	Halt immediately on first execution error
`false`	Continue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Validate Before Running

Check eval files for schema errors without executing:

agentv validate evals/my-eval.yaml

Run a Single Assertion

Run a code-grader assertion in isolation without executing a full eval suite:

agentv eval assert <name> --agent-output <text> --agent-input <text>

The command discovers the assertion script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}, then passes the input via stdin and prints the result JSON to stdout.

# Run an assertion with inline arguments
agentv eval assert rouge-score \
  --agent-output "The fox jumps over the lazy dog" \
  --agent-input "Summarise the article"

# Or pass a JSON payload file
agentv eval assert rouge-score --file result.json

The --file option reads a JSON file with { "output": "...", "input": "..." } fields.

Exit codes: 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits assert instructions for code graders so external grading agents can execute them directly.

Agent-Orchestrated Evals

Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.

Overview

agentv eval prompt eval --list evals/my-eval.yaml

Returns JSON listing the available test_ids for the eval file.

Get Task Input

agentv eval prompt eval --input evals/my-eval.yaml --test-id case-123

Returns JSON with:

input — [{role, content}] array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.
guideline_paths — files containing additional instructions to prepend to the system message.
criteria — grading criteria for the orchestrator’s reference (do not pass to the candidate).

Get Grading Context

agentv eval prompt eval --expected-output evals/my-eval.yaml --test-id case-123

Returns JSON with the data an external grader needs:

expected_output — reference assistant messages
reference_answer — flattened reference text when available
criteria — high-level success criteria
assertions — evaluator configs for the test

Get Grading Brief

Output a human-readable summary of the grading criteria for a specific test, with type-prefixed assertion tags:

agentv eval prompt eval --grading-brief evals/my-eval.yaml --test-id case-123

Example output:

Input: "Summarise the following article in one sentence."
Expected: "The quick brown fox jumps over the lazy dog near the river bank."
Criteria:
  - [code-grader] rouge-score: Measures n-gram recall and F1
  - [llm-grader] Summary captures key points
  - [skill-trigger] should_trigger: true for summariser

This is useful for agents orchestrating evals to understand what criteria a test is evaluated against before running it.

When to Use

Scenario	Command
Have API keys, want end-to-end automation	`agentv eval`
Run a single assertion in isolation	`agentv eval assert <name>`
No API keys, external agent can orchestrate the run	`agentv eval prompt eval --list/--input/--expected-output`
Inspect grading criteria before running	`agentv eval prompt eval --grading-brief`

Version Requirements

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

Condition	Interactive (TTY)	Non-interactive (CI)
Version satisfies range	Runs silently	Runs silently
Version below range	Warns + prompts to continue	Warns to stderr, continues
`--strict` flag + mismatch	Warns + exits 1	Warns + exits 1
No `required_version` set	Runs silently	Runs silently
Malformed semver range	Error + exits 1	Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

agentv eval --strict evals/my-eval.yaml

Config File Defaults

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

YAML config (`.agentv/config.yaml`)

execution:
  verbose: true
  trace_file: .agentv/results/trace-{timestamp}.jsonl
  keep_workspaces: false
  otel_file: .agentv/results/otel-{timestamp}.json

Field	CLI equivalent	Type	Default	Description
`verbose`	`--verbose`	boolean	`false`	Enable verbose logging
`trace_file`	`--trace-file`	string	none	Write human-readable trace JSONL
`keep_workspaces`	`--keep-workspaces`	boolean	`false`	Always keep temp workspaces after eval
`otel_file`	`--otel-file`	string	none	Write OTLP JSON trace to file

TypeScript config (`agentv.config.ts`)

import { defineConfig } from '@agentv/core';

export default defineConfig({
  execution: {
    verbose: true,
    traceFile: '.agentv/results/trace-{timestamp}.jsonl',
    keepWorkspaces: false,
    otelFile: '.agentv/results/otel-{timestamp}.json',
  },
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

Environment Variables

AGENTV_HOME

Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):

# Linux/macOS
export AGENTV_HOME=/data/agentv

# Windows (PowerShell)
$env:AGENTV_HOME = "D:\agentv"

# Windows (CMD)
set AGENTV_HOME=D:\agentv

When set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.

If you use a custom AGENTV_HOME on Windows for large monorepo workspaces, enable long path support:

git config --system core.longpaths true

Or set the registry key: HKLM\SYSTEM\CurrentControlSet\Control\FileSystem\LongPathsEnabled = 1

All Options

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.