Skip to content

Eval Files

Evaluation files define the test cases, targets, and evaluators for an evaluation run. AgentV supports two formats: YAML and JSONL.

The primary format. A single file contains metadata, execution config, and tests:

description: Math problem solving evaluation
execution:
target: default
assertions:
- name: correctness
type: llm-grader
prompt: ./graders/correctness.md
tests:
- id: addition
criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"
FieldDescription
descriptionHuman-readable description of the evaluation
datasetOptional dataset identifier
executionDefault execution config (target, fail_on_error, etc.)
workspaceSuite-level workspace config — inline object or string path to an external workspace file
testsArray of individual tests, or a string path to an external file
assertSuite-level evaluators appended to each test unless execution.skip_defaults: true is set on the test
inputSuite-level input messages prepended to each test’s input unless execution.skip_defaults: true is set on the test

You can add structured metadata to your eval file using these optional top-level fields. Metadata is parsed when the name field is present:

FieldDescription
nameMachine-readable identifier (lowercase, hyphens, max 64 chars). Triggers metadata parsing.
descriptionHuman-readable description (max 1024 chars)
versionEval version string (e.g., "1.0")
authorAuthor or team identifier
tagsArray of string tags for categorization
licenseLicense identifier (e.g., "MIT", "Apache-2.0")
requiresDependency constraints (e.g., agentv: ">=0.30.0")
name: export-screening
description: Evaluates export control screening accuracy
version: "1.0"
author: acme-compliance
tags: [compliance, agents]
license: Apache-2.0
requires:
agentv: ">=0.30.0"
tests:
- id: denied-party
criteria: Identifies denied parties correctly
input: Screen "Acme Corp" against denied parties list

The assert field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test’s evaluators unless a test sets execution.skip_defaults: true.

description: API response validation
assertions:
- type: is-json
required: true
- type: contains
value: "status"
tests:
- id: health-check
criteria: Returns health status
input: Check API health

assert supports all evaluator types, including deterministic assertion types (contains, regex, is_json, equals) and rubrics. See Tests for per-test assert usage.

The input field defines messages that are prepended to every test’s input. This avoids repeating the same prompt or system context in each test case — following the same pattern as suite-level assert.

description: Travel assistant evaluation
input:
- role: user
content:
- type: file
value: ./system-prompt.md
tests: ./cases.yaml

Each test in cases.yaml only needs its own query:

- id: japan-spring
criteria: Recommends spring for cherry blossoms
input: When is the best time to visit Japan?

The effective input at runtime becomes [...suite input, ...test input].

Suite-level input accepts the same formats as test-level input:

  • String — wrapped as [{ role: "user", content: "..." }]
  • Message array — used as-is, including file references

To opt out for a specific test, set execution.skip_defaults: true (same flag that skips suite-level assert).

The input_files field provides a shorthand for attaching shared file references to every test. When a test has a string input, the suite-level files are prepended as type: file content blocks in a single user message — the same shape produced by per-test input_files.

description: Schema review evaluation
input_files:
- ./shared-context.md
- ./schema.json
tests:
- id: summarize
criteria: Summarizes the important constraints
input: Summarize the important constraints.
- id: validate
criteria: Identifies validation gaps
input: What validation is missing?

Each test’s effective input becomes a single user message with [file blocks..., text block].

Per-test input_files overrides the suite-level value (it does not merge). To opt out, set execution.skip_defaults: true on the test.

Instead of inlining tests in the same file, you can point tests to an external YAML or JSONL file. This is the inverse of the sidecar pattern — the metadata file references the test data:

name: my-eval
description: My evaluation suite
execution:
target: default
tests: ./cases.yaml

The path is resolved relative to the eval file’s directory. The external file should contain a YAML array of test objects or a JSONL file with one test per line.

All string fields in eval files support ${{ VAR }} syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.

workspace:
repos:
- path: ./RepoA
source:
type: local
path: "${{ REPO_A_PATH }}"
tests:
- id: test-1
input: "Evaluate the code in ${{ PROJECT_NAME }}"
criteria: "${{ EVAL_CRITERIA }}"
  • Syntax: ${{ VARIABLE_NAME }} with optional whitespace around the name
  • Missing variables resolve to an empty string
  • Partial interpolation is supported: ${{ HOME }}/repos/${{ PROJECT }} becomes /home/user/repos/myproject
  • Non-string values (numbers, booleans) are not affected
  • Interpolation is applied recursively to all nested objects and arrays
  • Works in YAML eval files, external YAML/JSONL case files, and external workspace config files
  • .env files in the directory hierarchy are loaded automatically before interpolation
# workspace.yaml — works on any machine
repos:
- path: ./my-repo
source:
type: local
path: "${{ MY_REPO_LOCAL_PATH }}"
.env
MY_REPO_LOCAL_PATH=/home/dev/repos/my-repo

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format. Each line is a single test:

{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}

An optional YAML sidecar file provides metadata and execution config. Place it alongside the JSONL file with the same base name:

dataset.jsonl + dataset.eval.yaml:

description: Math evaluation dataset
dataset: math-tests
execution:
target: azure-base
assertions:
- name: correctness
type: llm-grader
prompt: ./graders/correctness.md
  • Streaming-friendly — process line by line
  • Git-friendly — diffs show individual case changes
  • Programmatic generation — easy to create from scripts
  • Industry standard — compatible with DeepEval, LangWatch, Hugging Face datasets

Use the convert command to switch between YAML and JSONL:

Terminal window
agentv convert evals/dataset.eval.yaml --format jsonl
agentv convert evals/dataset.jsonl --format yaml