Are you an LLM? Read llms.txt for a summary of the docs, or llms-full.txt for the full context.
Skip to content

amodal eval

Run evaluation suites against your agent to measure quality, compare models, and track regressions.

amodal eval

Eval Files

Evals live in .amodal/evals/ as YAML files:

name: triage-accuracy
description: Test alert triage quality
cases:
  - input: "Review recent security alerts"
    rubric:
      - "Correctly identifies critical alerts"
      - "Filters known false positives"
      - "Provides severity ranking"
    expected_tools:
      - request
      - load_knowledge

Evaluation Methods

MethodDescription
LLM JudgeAn LLM evaluates the agent's response against the rubric
Tool usageVerify expected tools were called
Cost trackingTrack token usage and cost per eval

Experiments

Compare different configurations side-by-side:

amodal experiment

Experiments let you test:

  • Different LLM providers or models
  • Different skill configurations
  • Different prompt variations
  • Different knowledge documents

Results include cost comparison, quality scores, and latency metrics.

Multi-Model Comparison

Run the same eval suite against multiple providers to find the best model for your use case:

amodal eval --providers anthropic,openai,google

Platform Integration

Eval results can be sent to the platform API for tracking trends, baselines, and comparisons over time.