amodal eval
Run evaluation suites against your agent to measure quality, compare models, and track regressions.
amodal evalEval Files
Evals live in evals/ as Markdown files:
# Eval: Triage Accuracy
Test alert triage quality.
## Query
"Review recent security alerts"
## Assertions
- Should correctly identify critical alerts
- Should filter known false positives
- Should provide severity ranking
- Should use the request tool to query detection systems
- Should NOT fabricate alert dataAssertions
Assertions come in two types: deterministic checks that run instantly with no LLM call, and LLM-judged assertions that send the response to a judge model for evaluation. Both can be mixed freely in the same eval.
Deterministic Assertions
Deterministic assertions use a key: value format. They are instant, free (no tokens), and fully reproducible.
- contains: "exact string" # response includes this substring
- regex: "\\d{3}-\\d{4}" # response matches regex pattern
- starts_with: "{" # response starts with this string
- length_between: [100, 5000] # character count is in range- tool_called: request # agent called this tool
- tool_not_called: write_repo_file # agent did NOT call this tool- max_latency: 10000 # completed under N milliseconds
- max_turns: 3 # agent loop completed in N turnsNegation: Prefix any assertion with NOT to invert it:
- NOT contains: "#hashtag"
- NOT regex: "\\bTODO\\b"LLM-Judged Assertions
Any assertion that does not match the key: value format is sent to an LLM judge. Use these for subjective quality checks that a regex cannot capture. They are slower and cost tokens, but can evaluate nuance.
- Should explain the concept clearly
- Response is professional in tone
- Does not hallucinate data not present in the sourceMixing Both Types
Use deterministic assertions for structural requirements and LLM-judged assertions for quality:
## Assertions
- contains: "```json"
- starts_with: "{"
- tool_called: request
- max_turns: 5
- Should explain the reasoning behind the result
- Should NOT include PIIEvaluation Methods
| Method | Description |
|---|---|
| Deterministic | Instant checks — string matching, regex, tool usage, latency, turn count |
| LLM Judge | An LLM evaluates the agent's response against plain-English assertions |
| Cost tracking | Track token usage and cost per eval |
Experiments
Compare different configurations side-by-side:
amodal ops experimentExperiments let you test:
- Different LLM providers or models
- Different skill configurations
- Different prompt variations
- Different knowledge documents
Results include cost comparison, quality scores, and latency metrics.
Multi-Model Comparison
Run the same eval suite against multiple providers to find the best model for your use case:
amodal eval --providers anthropic,openai,google