Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

amodal eval

Run evaluation suites against your agent to measure quality, compare models, and track regressions.

amodal eval

Eval Files

Evals live in evals/ as Markdown files:

# Eval: Triage Accuracy
 
Test alert triage quality.
 
## Query
 
"Review recent security alerts"
 
## Assertions
 
- Should correctly identify critical alerts
- Should filter known false positives
- Should provide severity ranking
- Should use the request tool to query detection systems
- Should NOT fabricate alert data

Assertions

Assertions come in two types: deterministic checks that run instantly with no LLM call, and LLM-judged assertions that send the response to a judge model for evaluation. Both can be mixed freely in the same eval.

Deterministic Assertions

Deterministic assertions use a key: value format. They are instant, free (no tokens), and fully reproducible.

Response text checks:
- contains: "exact string"          # response includes this substring
- regex: "\\d{3}-\\d{4}"            # response matches regex pattern
- starts_with: "{"                  # response starts with this string
- length_between: [100, 5000]       # character count is in range
Tool usage checks:
- tool_called: request              # agent called this tool
- tool_not_called: write_repo_file  # agent did NOT call this tool
Performance checks:
- max_latency: 10000                # completed under N milliseconds
- max_turns: 3                      # agent loop completed in N turns

Negation: Prefix any assertion with NOT to invert it:

- NOT contains: "#hashtag"
- NOT regex: "\\bTODO\\b"

LLM-Judged Assertions

Any assertion that does not match the key: value format is sent to an LLM judge. Use these for subjective quality checks that a regex cannot capture. They are slower and cost tokens, but can evaluate nuance.

- Should explain the concept clearly
- Response is professional in tone
- Does not hallucinate data not present in the source

Mixing Both Types

Use deterministic assertions for structural requirements and LLM-judged assertions for quality:

## Assertions
 
- contains: "```json"
- starts_with: "{"
- tool_called: request
- max_turns: 5
- Should explain the reasoning behind the result
- Should NOT include PII

Evaluation Methods

MethodDescription
DeterministicInstant checks — string matching, regex, tool usage, latency, turn count
LLM JudgeAn LLM evaluates the agent's response against plain-English assertions
Cost trackingTrack token usage and cost per eval

Experiments

Compare different configurations side-by-side:

amodal ops experiment

Experiments let you test:

  • Different LLM providers or models
  • Different skill configurations
  • Different prompt variations
  • Different knowledge documents

Results include cost comparison, quality scores, and latency metrics.

Multi-Model Comparison

Run the same eval suite against multiple providers to find the best model for your use case:

amodal eval --providers anthropic,openai,google