Skip to main content
Test Mode allows you to validate your agent implementation locally using cloud LLM providers (OpenAI, Anthropic, etc.) before deploying to the Osmosis training infrastructure.

Overview

osmosis test -m <module:agent> -d <dataset> [options]
Test Mode:
  • Runs your agent loop against a dataset
  • Uses external LLM providers via LiteLLM
  • Computes rewards using your ground_truth data
  • Supports both batch and interactive execution

Dataset Format

Test datasets require these columns:
ColumnRequiredDescription
system_promptYesSystem prompt for the LLM
user_promptYesUser message to start conversation
ground_truthNoExpected output for reward computation

Supported Formats

  • JSONL (.jsonl)
  • JSON (.json)
  • Parquet (.parquet)

Example Dataset

test_data.jsonl:
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 15 * 7?", "ground_truth": "105"}
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 100 / 4?", "ground_truth": "25"}
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 8 + 13?", "ground_truth": "21"}

Basic Usage

Set Up API Key

# OpenAI (default)
export OPENAI_API_KEY="your-key"

# Anthropic
export ANTHROPIC_API_KEY="your-key"

Run Tests

# Default: GPT-4o
osmosis test -m server:agent_loop -d test_data.jsonl

# Use Claude
osmosis test -m server:agent_loop -d test_data.jsonl --model anthropic/claude-sonnet-4-20250514

# Use custom OpenAI-compatible API
osmosis test -m server:agent_loop -d test_data.jsonl --base-url http://localhost:8000/v1

CLI Options

Required

OptionDescription
-m, --moduleModule path to agent loop (e.g., server:agent_loop)
-d, --datasetPath to dataset file

Model Selection

OptionDefaultDescription
--modelgpt-4oModel name in LiteLLM format
--api-keyenv varAPI key for provider
--base-urlCustom API base URL

Execution Control

OptionDefaultDescription
--max-turns10Maximum agent turns per row
--temperatureLLM sampling temperature
--max-tokensMaximum tokens per completion
--limitallMaximum rows to test
--offset0Rows to skip

Output

OptionDescription
-o, --outputSave results to JSON file
-q, --quietSuppress progress output
--debugEnable debug output

Interactive Mode

OptionDescription
-i, --interactiveEnable step-by-step execution
--rowStart at specific row index

Batch Mode

Run all rows and get summary:
osmosis test -m server:agent_loop -d test_data.jsonl
Output:
osmosis-rollout-test v0.2.8
Loading agent: server:agent_loop
  Agent name: calculator
Loading dataset: test_data.jsonl
  Total rows: 3
Initializing provider: openai
  Model: gpt-4o

Running tests...
[1/3] Row 0: OK (2.1s, 148 tokens)
[2/3] Row 1: OK (1.9s, 152 tokens)
[3/3] Row 2: OK (1.7s, 134 tokens)

Summary:
  Total: 3
  Passed: 3
  Failed: 0
  Duration: 5.7s
  Total tokens: 434

Save Results

osmosis test -m server:agent_loop -d test_data.jsonl -o results.json
Output JSON structure:
{
  "summary": {
    "total": 3,
    "passed": 3,
    "failed": 0,
    "total_duration_ms": 5700,
    "total_tokens": 434
  },
  "results": [
    {
      "row_index": 0,
      "success": true,
      "error": null,
      "duration_ms": 2100,
      "token_usage": {"total_tokens": 148},
      "reward": 1.0,
      "finish_reason": "stop"
    }
  ]
}

Test Subset

# First 10 rows
osmosis test -m server:agent_loop -d test_data.jsonl --limit 10

# Skip first 50 rows, test next 20
osmosis test -m server:agent_loop -d test_data.jsonl --offset 50 --limit 20

Interactive Mode

Step through agent execution for debugging:
osmosis test -m server:agent_loop -d test_data.jsonl --interactive

Commands

CommandDescription
nExecute next LLM call
cContinue to completion
mShow current message history
tShow available tools
qQuit session

Start at Specific Row

# Start at row 5
osmosis test -m server:agent_loop -d test_data.jsonl --interactive --row 5

Interactive Session Example

=== Interactive Mode ===
Dataset: test_data.jsonl (3 rows)
Starting at row 0

--- Row 0 ---
System: You are a calculator...
User: What is 15 * 7?

[turn 0] > n
Calling LLM...
Assistant: I'll calculate 15 * 7 using the multiply tool.
Tool calls: multiply(a=15, b=7)

Executing tools...
Tool result: 105

[turn 1] > n
Calling LLM...
Assistant: The answer is:

#### 105

[turn 1] Rollout complete
  Finish reason: stop
  Reward: 1.0

Continue to next row? (y/n/q) >

Supported Providers

Test Mode uses LiteLLM for provider support:
ProviderModel FormatAPI Key Env
OpenAIgpt-4o, gpt-4o-miniOPENAI_API_KEY
Anthropicanthropic/claude-sonnet-4-20250514ANTHROPIC_API_KEY
Googlegemini/gemini-2.0-flashGOOGLE_API_KEY
Groqgroq/llama-3.3-70b-versatileGROQ_API_KEY
Togethertogether/meta-llama/Llama-3-70bTOGETHER_API_KEY

Custom OpenAI-Compatible APIs

osmosis test -m server:agent_loop -d test_data.jsonl \
  --base-url http://localhost:8000/v1 \
  --model my-local-model

Tips & Best Practices

Use --interactive first to understand agent behavior before running batch tests.
Include diverse test cases: simple queries, multi-step problems, edge cases, and potential failure modes.
Check token counts in results to estimate training costs.
Keep test datasets versioned alongside your agent code.
Test with multiple models to ensure agent works across providers.

Troubleshooting

”API key not found"

# Set the appropriate environment variable
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

"Model not found”

Check the model name format:
  • OpenAI: gpt-4o (no prefix needed)
  • Anthropic: anthropic/claude-sonnet-4-20250514
  • Google: gemini/gemini-2.0-flash

”No rows to test”

Check your dataset file:
  • Ensure it has valid JSON on each line (for JSONL)
  • Verify required columns exist
  • Check --offset isn’t skipping all rows

Tool Errors

Enable debug mode for detailed output:
osmosis test -m server:agent_loop -d test_data.jsonl --debug

Next Steps