Evals

Evals live in evals/ as Markdown files. Each eval defines a query, setup context, and assertions that measure agent quality.

Eval File Format

# Eval: Support Triage
 
Tests the agent's ability to triage a support question.
 
## Setup
 
Context: Ticket T-100 is about a customer who cannot complete password reset.
 
## Query
 
"What should the support operator do next?"
 
## Assertions
 
- Should identify this as an account-access issue
- Should mention identity verification before account changes
- Should separate confirmed facts from assumptions
- Should provide one concise next step
- Should NOT ask for or expose credentials
- Should NOT promise an account change without confirmation

Parsed Fields

Field	Source	Description
`name`	Filename without `.md`	Eval identifier
`title`	`# Eval: Title` heading	Display name
`description`	Text between heading and first `##`	What the eval tests
`setup.context`	`Context:` line in `## Setup`	Background context
`query`	Content of `## Query` (without quotes)	The user message to test
`assertions`	`## Assertions` list items	Quality criteria

Assertions

Lines starting with - Should are positive assertions — things the agent must do.

Lines starting with - Should NOT are negated assertions — things the agent must avoid.

Running Evals

Run eval suites from Amodal or the Platform API. You can run all evals, target a single eval file, or compare the same suite across providers.

Evaluation Methods

Method	Description
LLM Judge	A separate LLM evaluates the response against each assertion
Tool usage	Verify expected tools were called
Cost tracking	Token usage and cost per eval case

Experiments

Compare configurations side-by-side. Test different models, skills, knowledge docs, or prompts. Results include quality scores, costs, and latency.

Multi-Model Comparison

Run the same suite against each provider for cost/quality/latency comparison.