Overview
- Runs your agent loop against a dataset
- Uses external LLM providers via LiteLLM
- Computes rewards using your
ground_truthdata - Supports both batch and interactive execution
Dataset Format
Test datasets require these columns:| Column | Required | Description |
|---|---|---|
system_prompt | Yes | System prompt for the LLM |
user_prompt | Yes | User message to start conversation |
ground_truth | No | Expected output for reward computation |
Supported Formats
- JSONL (
.jsonl) - JSON (
.json) - Parquet (
.parquet)
Example Dataset
test_data.jsonl:
Basic Usage
Set Up API Key
Run Tests
Batch Mode
Run all rows and get summary:Save Results
Test Subset
Interactive Mode
Step through agent execution for debugging:Interactive Session Example
Supported Providers
Test Mode uses LiteLLM for provider support. Non-OpenAI providers require a prefix (e.g.,anthropic/, gemini/, groq/). OpenAI models need no prefix.
Custom OpenAI-Compatible APIs
Tips & Best Practices
Start with Interactive Mode
Start with Interactive Mode
Use
--interactive first to understand agent behavior before running batch tests.Test Edge Cases
Test Edge Cases
Include diverse test cases: simple queries, multi-step problems, edge cases, and potential failure modes.
Compare Models
Compare Models
Test with multiple models to ensure agent works across providers and to estimate training costs from token usage.
Troubleshooting
Set the appropriate environment variable if you see “API key not found”:gpt-5.2), while other providers require a prefix (anthropic/claude-sonnet-4-5, gemini/gemini-3-flash-preview).
If you see “No rows to test”, verify your dataset has valid JSON on each line (for JSONL), that required columns exist, and that --offset is not skipping all rows.
Enable debug mode for detailed output on tool errors:
Next Steps
Eval Mode
Evaluate trained models with pass@k metrics
Agent Loop Guide
Learn advanced agent patterns
CLI Reference
Complete CLI documentation