Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

Evals

Evals live in evals/ as Markdown files. Each eval defines a query, setup context, and assertions that measure agent quality.

Eval File Format

# Eval: Revenue Drop Investigation
 
Tests the agent's ability to investigate a revenue anomaly.
 
## Setup
 
Tenant: test-tenant
Context: Revenue dropped 30% yesterday compared to the weekly average.
 
## Query
 
"Revenue was down 30% yesterday. What happened?"
 
## Assertions
 
- Should query Stripe charges for the relevant time period
- Should compare against baseline or previous period
- Should check for known issues (billing cycle, deployments, timezone effects)
- Should provide specific numbers (dollar amounts, percentages)
- Should NOT fabricate data or guess without querying
- Should NOT blame external factors without evidence

Parsed Fields

FieldSourceDescription
nameFilename without .mdEval identifier
title# Eval: Title headingDisplay name
descriptionText between heading and first ##What the eval tests
setup.tenantTenant: line in ## SetupTenant to use
setup.contextContext: line in ## SetupBackground context
queryContent of ## Query (without quotes)The user message to test
assertions## Assertions list itemsQuality criteria

Assertions

Lines starting with - Should are positive assertions — things the agent must do.

Lines starting with - Should NOT are negated assertions — things the agent must avoid.

Running Evals

amodal eval                              # run all evals
amodal eval --file revenue-drop.md       # run one eval
amodal eval --providers anthropic,openai # compare providers

Evaluation Methods

MethodDescription
LLM JudgeA separate LLM evaluates the response against each assertion
Tool usageVerify expected tools were called
Cost trackingToken usage and cost per eval case

Experiments

Compare configurations side-by-side:

amodal experiment

Test different models, skills, knowledge docs, or prompts. Results include quality scores, costs, and latency.

Multi-Model Comparison

amodal eval --providers anthropic,openai,google

Runs the same suite against each provider for cost/quality/latency comparison.

Platform Integration

Results can be sent to the platform API for trend tracking, baseline comparison, and run history.