Overview
- Runs your agent loop against a dataset
- Uses external LLM providers via LiteLLM
- Computes rewards using your
ground_truthdata - Supports both batch and interactive execution
Dataset Format
Test datasets require these columns:| Column | Required | Description |
|---|---|---|
system_prompt | Yes | System prompt for the LLM |
user_prompt | Yes | User message to start conversation |
ground_truth | No | Expected output for reward computation |
Supported Formats
- JSONL (
.jsonl) - JSON (
.json) - Parquet (
.parquet)
Example Dataset
test_data.jsonl:
Basic Usage
Set Up API Key
Run Tests
CLI Options
Required
| Option | Description |
|---|---|
-m, --module | Module path to agent loop (e.g., server:agent_loop) |
-d, --dataset | Path to dataset file |
Model Selection
| Option | Default | Description |
|---|---|---|
--model | gpt-4o | Model name in LiteLLM format |
--api-key | env var | API key for provider |
--base-url | Custom API base URL |
Execution Control
| Option | Default | Description |
|---|---|---|
--max-turns | 10 | Maximum agent turns per row |
--temperature | LLM sampling temperature | |
--max-tokens | Maximum tokens per completion | |
--limit | all | Maximum rows to test |
--offset | 0 | Rows to skip |
Output
| Option | Description |
|---|---|
-o, --output | Save results to JSON file |
-q, --quiet | Suppress progress output |
--debug | Enable debug output |
Interactive Mode
| Option | Description |
|---|---|
-i, --interactive | Enable step-by-step execution |
--row | Start at specific row index |
Batch Mode
Run all rows and get summary:Save Results
Test Subset
Interactive Mode
Step through agent execution for debugging:Commands
| Command | Description |
|---|---|
n | Execute next LLM call |
c | Continue to completion |
m | Show current message history |
t | Show available tools |
q | Quit session |
Start at Specific Row
Interactive Session Example
Supported Providers
Test Mode uses LiteLLM for provider support:| Provider | Model Format | API Key Env |
|---|---|---|
| OpenAI | gpt-4o, gpt-4o-mini | OPENAI_API_KEY |
| Anthropic | anthropic/claude-sonnet-4-20250514 | ANTHROPIC_API_KEY |
gemini/gemini-2.0-flash | GOOGLE_API_KEY | |
| Groq | groq/llama-3.3-70b-versatile | GROQ_API_KEY |
| Together | together/meta-llama/Llama-3-70b | TOGETHER_API_KEY |
Custom OpenAI-Compatible APIs
Tips & Best Practices
Start with Interactive Mode
Start with Interactive Mode
Use
--interactive first to understand agent behavior before running batch tests.Test Edge Cases
Test Edge Cases
Include diverse test cases: simple queries, multi-step problems, edge cases, and potential failure modes.
Monitor Token Usage
Monitor Token Usage
Check token counts in results to estimate training costs.
Version Your Datasets
Version Your Datasets
Keep test datasets versioned alongside your agent code.
Compare Models
Compare Models
Test with multiple models to ensure agent works across providers.
Troubleshooting
”API key not found"
"Model not found”
Check the model name format:- OpenAI:
gpt-4o(no prefix needed) - Anthropic:
anthropic/claude-sonnet-4-20250514 - Google:
gemini/gemini-2.0-flash
”No rows to test”
Check your dataset file:- Ensure it has valid JSON on each line (for JSONL)
- Verify required columns exist
- Check
--offsetisn’t skipping all rows