amodal eval
Run evaluation suites against your agent to measure quality, compare models, and track regressions.
amodal evalEval Files
Evals live in .amodal/evals/ as YAML files:
name: triage-accuracy
description: Test alert triage quality
cases:
- input: "Review recent security alerts"
rubric:
- "Correctly identifies critical alerts"
- "Filters known false positives"
- "Provides severity ranking"
expected_tools:
- request
- load_knowledgeEvaluation Methods
| Method | Description |
|---|---|
| LLM Judge | An LLM evaluates the agent's response against the rubric |
| Tool usage | Verify expected tools were called |
| Cost tracking | Track token usage and cost per eval |
Experiments
Compare different configurations side-by-side:
amodal experimentExperiments let you test:
- Different LLM providers or models
- Different skill configurations
- Different prompt variations
- Different knowledge documents
Results include cost comparison, quality scores, and latency metrics.
Multi-Model Comparison
Run the same eval suite against multiple providers to find the best model for your use case:
amodal eval --providers anthropic,openai,googlePlatform Integration
Eval results can be sent to the platform API for tracking trends, baselines, and comparisons over time.