Running Evaluations

Run an Evaluation

agentv eval evals/my-eval.yaml

Results are written to .agentv/results/eval_<timestamp>.jsonl.

Common Options

Override Target

Run against a different target than specified in the eval file:

agentv eval --target azure_base evals/**/*.yaml

Run Specific Test

Run a single test by ID:

agentv eval --test-id case-123 evals/my-eval.yaml

Dry Run

Test the harness flow with mock responses (does not call real providers):

agentv eval --dry-run evals/my-eval.yaml

Output to Specific File

agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Trace Persistence

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

# Human-readable JSONL trace (one record per test case)
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl

# OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json

# Both formats simultaneously
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl --otel-file traces/eval.otlp.json

The --trace-file format writes JSONL records containing:

test_id - The test identifier
target / score - Target and evaluation score
duration_ms - Total execution duration
spans - Array of tool invocations with timing
token_usage / cost_usd - Resource consumption

The --otel-file format writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Workspace Cleanup

When using workspace_template or the workspace config block, temporary workspaces are created for each test. By default, workspaces are cleaned up on success and preserved on failure for debugging.

# Always keep workspaces (for debugging)
agentv eval evals/my-eval.yaml --keep-workspaces

# Always cleanup workspaces (even on failure)
agentv eval evals/my-eval.yaml --cleanup-workspaces

Workspaces are stored at ~/.agentv/workspaces/<eval-run-id>/<test-id>/.

Retry Execution Errors

Re-run only the tests that had infrastructure/execution errors from a previous output:

agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Execution Error Tolerance

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
  fail_on_error: false    # never halt on errors (default)
  # fail_on_error: true   # halt on first execution error

Value	Behavior
`true`	Halt immediately on first execution error
`false`	Continue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Validate Before Running

Check eval files for schema errors without executing:

agentv validate evals/my-eval.yaml

Agent-Orchestrated Evals

Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.

Overview

agentv prompt eval evals/my-eval.yaml

Outputs a step-by-step orchestration prompt listing all tests and the commands to run for each.

Get Task Input

agentv prompt eval input evals/my-eval.yaml --test-id case-123

Returns JSON with:

input — [{role, content}] array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.
guideline_paths — files containing additional instructions to prepend to the system message.
criteria — grading criteria for the orchestrator’s reference (do not pass to the candidate).

Judge the Result

agentv prompt eval judge evals/my-eval.yaml --test-id case-123 --answer-file response.txt

Runs code judges deterministically and returns LLM judge prompts for the agent to execute. Each evaluator in the output has a status:

"completed" — Score is final (e.g., code judge). Read result.score.
"prompt_ready" — LLM grading required. Send prompt.system_prompt and prompt.user_prompt to your LLM and parse the JSON response.

When to Use

Scenario	Command
Have API keys, want end-to-end automation	`agentv eval`
No API keys, agent can act as the LLM	`agentv prompt`

Version Requirements

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

Condition	Interactive (TTY)	Non-interactive (CI)
Version satisfies range	Runs silently	Runs silently
Version below range	Warns + prompts to continue	Warns to stderr, continues
`--strict` flag + mismatch	Warns + exits 1	Warns + exits 1
No `required_version` set	Runs silently	Runs silently
Malformed semver range	Error + exits 1	Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

agentv eval --strict evals/my-eval.yaml

Config File Defaults

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

YAML config (`.agentv/config.yaml`)

execution:
  verbose: true
  trace_file: .agentv/results/trace-{timestamp}.jsonl
  keep_workspaces: false
  otel_file: .agentv/results/otel-{timestamp}.json

Field	CLI equivalent	Type	Default	Description
`verbose`	`--verbose`	boolean	`false`	Enable verbose logging
`trace_file`	`--trace-file`	string	none	Write human-readable trace JSONL
`keep_workspaces`	`--keep-workspaces`	boolean	`false`	Always keep temp workspaces after eval
`otel_file`	`--otel-file`	string	none	Write OTLP JSON trace to file

TypeScript config (`agentv.config.ts`)

import { defineConfig } from '@agentv/core';

export default defineConfig({
  execution: {
    verbose: true,
    traceFile: '.agentv/results/trace-{timestamp}.jsonl',
    keepWorkspaces: false,
    otelFile: '.agentv/results/otel-{timestamp}.json',
  },
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

All Options

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.