Skip to main content
Before connecting to the Osmosis training cluster, you can run your agent end-to-end on your own machine using any cloud LLM provider (OpenAI, Anthropic, etc.) to verify behavior and debug issues.

Overview

osmosis test -m <module:agent> -d <dataset> [options]
Test Mode:
  • Runs your agent loop against a dataset
  • Uses external LLM providers via LiteLLM
  • Computes rewards using your ground_truth data
  • Supports both batch and interactive execution

Dataset Format

Test datasets require these columns:
ColumnRequiredDescription
system_promptYesSystem prompt for the LLM
user_promptYesUser message to start conversation
ground_truthNoExpected output for reward computation

Supported Formats

  • JSONL (.jsonl)
  • JSON (.json)
  • Parquet (.parquet)

Example Dataset

test_data.jsonl:
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 15 * 7?", "ground_truth": "105"}
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 100 / 4?", "ground_truth": "25"}
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 8 + 13?", "ground_truth": "21"}

Basic Usage

Set Up API Key

# OpenAI (default)
export OPENAI_API_KEY="your-key"

# Anthropic
export ANTHROPIC_API_KEY="your-key"

Run Tests

# Default: GPT-5-mini
osmosis test -m server:agent_loop -d test_data.jsonl

# Use Claude
osmosis test -m server:agent_loop -d test_data.jsonl --model anthropic/claude-sonnet-4-5

# Use custom OpenAI-compatible API
osmosis test -m server:agent_loop -d test_data.jsonl --base-url http://localhost:8000/v1
For the complete list of CLI options, see the CLI Reference.

Batch Mode

Run all rows and get summary:
osmosis test -m server:agent_loop -d test_data.jsonl
Output:
osmosis-rollout-test v0.2.13
Loading agent: server:agent_loop
  Agent name: calculator
Loading dataset: test_data.jsonl
  Total rows: 3
Initializing provider: openai
  Model: gpt-5-mini

Running tests...
[1/3] Row 0: OK (2.1s, 148 tokens)
[2/3] Row 1: OK (1.9s, 152 tokens)
[3/3] Row 2: OK (1.7s, 134 tokens)

Summary:
  Total: 3
  Passed: 3
  Failed: 0
  Duration: 5.7s
  Total tokens: 434

Save Results

osmosis test -m server:agent_loop -d test_data.jsonl -o results.json
Output JSON structure:
{
  "summary": {
    "total": 3,
    "passed": 3,
    "failed": 0,
    "total_duration_ms": 5700,
    "total_tokens": 434
  },
  "results": [
    {
      "row_index": 0,
      "success": true,
      "error": null,
      "duration_ms": 2100,
      "token_usage": {"total_tokens": 148},
      "reward": 1.0,
      "finish_reason": "stop"
    }
  ]
}

Test Subset

# First 10 rows
osmosis test -m server:agent_loop -d test_data.jsonl --limit 10

# Skip first 50 rows, test next 20
osmosis test -m server:agent_loop -d test_data.jsonl --offset 50 --limit 20

Interactive Mode

Step through agent execution for debugging:
osmosis test -m server:agent_loop -d test_data.jsonl --interactive

Interactive Session Example

=== Interactive Mode ===
Dataset: test_data.jsonl (3 rows)
Starting at row 0

--- Row 0 ---
System: You are a calculator...
User: What is 15 * 7?

[turn 0] > n
Calling LLM...
Assistant: I'll calculate 15 * 7 using the multiply tool.
Tool calls: multiply(a=15, b=7)

Executing tools...
Tool result: 105

[turn 1] > n
Calling LLM...
Assistant: The answer is:

#### 105

[turn 1] Rollout complete
  Finish reason: stop
  Reward: 1.0

Continue to next row? (y/n/q) >
For interactive mode commands and additional options, see the CLI Reference.

Supported Providers

Test Mode uses LiteLLM for provider support. Non-OpenAI providers require a prefix (e.g., anthropic/, gemini/, groq/). OpenAI models need no prefix.

Custom OpenAI-Compatible APIs

osmosis test -m server:agent_loop -d test_data.jsonl \
  --base-url http://localhost:8000/v1 \
  --model my-local-model

Tips & Best Practices

Use --interactive first to understand agent behavior before running batch tests.
Include diverse test cases: simple queries, multi-step problems, edge cases, and potential failure modes.
Test with multiple models to ensure agent works across providers and to estimate training costs from token usage.

Troubleshooting

Set the appropriate environment variable if you see “API key not found”:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
If you see “Model not found”, check the model name format. OpenAI models need no prefix (gpt-5.2), while other providers require a prefix (anthropic/claude-sonnet-4-5, gemini/gemini-3-flash-preview). If you see “No rows to test”, verify your dataset has valid JSON on each line (for JSONL), that required columns exist, and that --offset is not skipping all rows. Enable debug mode for detailed output on tool errors:
osmosis test -m server:agent_loop -d test_data.jsonl --debug

Next Steps

Eval Mode

Evaluate trained models with pass@k metrics

Agent Loop Guide

Learn advanced agent patterns

CLI Reference

Complete CLI documentation