amodal eval

Run evaluation suites against your agent to measure quality, compare models, and track regressions.

amodal eval

Eval Files

Evals live in evals/ as Markdown files:

# Eval: Triage Accuracy
 
Test alert triage quality.
 
## Query
 
"Review recent security alerts"
 
## Assertions
 
- Should correctly identify critical alerts
- Should filter known false positives
- Should provide severity ranking
- Should use the request tool to query detection systems
- Should NOT fabricate alert data

Assertions

Assertions come in two types: deterministic checks that run instantly with no LLM call, and LLM-judged assertions that send the response to a judge model for evaluation. Both can be mixed freely in the same eval.

Deterministic Assertions

Deterministic assertions use a key: value format. They are instant, free (no tokens), and fully reproducible.

Response text checks:

- contains: "exact string"          # response includes this substring
- regex: "\\d{3}-\\d{4}"            # response matches regex pattern
- starts_with: "{"                  # response starts with this string
- length_between: [100, 5000]       # character count is in range

Tool usage checks:

- tool_called: request              # agent called this tool
- tool_not_called: write_repo_file  # agent did NOT call this tool

Performance checks:

- max_latency: 10000                # completed under N milliseconds
- max_turns: 3                      # agent loop completed in N turns

Negation: Prefix any assertion with NOT to invert it:

- NOT contains: "#hashtag"
- NOT regex: "\\bTODO\\b"

LLM-Judged Assertions

Any assertion that does not match the key: value format is sent to an LLM judge. Use these for subjective quality checks that a regex cannot capture. They are slower and cost tokens, but can evaluate nuance.

- Should explain the concept clearly
- Response is professional in tone
- Does not hallucinate data not present in the source

Mixing Both Types

Use deterministic assertions for structural requirements and LLM-judged assertions for quality:

## Assertions
 
- contains: "```json"
- starts_with: "{"
- tool_called: request
- max_turns: 5
- Should explain the reasoning behind the result
- Should NOT include PII

Evaluation Methods

Method	Description
Deterministic	Instant checks — string matching, regex, tool usage, latency, turn count
LLM Judge	An LLM evaluates the agent's response against plain-English assertions
Cost tracking	Track token usage and cost per eval

Experiments

Compare different configurations side-by-side:

amodal ops experiment

Experiments let you test:

Different LLM providers or models
Different skill configurations
Different prompt variations
Different knowledge documents

Results include cost comparison, quality scores, and latency metrics.

Multi-Model Comparison

Run the same eval suite against multiple providers to find the best model for your use case:

amodal eval --providers anthropic,openai,google