Eval Mode

After training, use eval mode to benchmark your models by running your RolloutAgentLoop against datasets with custom eval functions, statistical analysis, and pass@k metrics.

Overview

osmosis eval -m <module:agent> -d <dataset> --model <model> --eval-fn <module:fn> [options]

Eval Mode:

Benchmarks trained models served at OpenAI-compatible endpoints
Scores agent outputs with custom eval functions
Supports pass@k analysis with multiple runs per row
Concurrent execution with --batch-size for faster benchmarks
Reuses existing @osmosis_reward functions as eval functions

osmosis eval is for evaluating agent performance with custom eval functions. For rubric-based evaluation of JSONL conversations, use osmosis eval-rubric instead.

Quick Start

The examples below use the osmosis-remote-rollout-example repository. Clone it and run the commands directly:

git clone https://github.com/Osmosis-AI/osmosis-remote-rollout-example.git
cd osmosis-remote-rollout-example && uv sync

Benchmark a Trained Model

Connect to your model serving endpoint and evaluate:

# Benchmark a trained model served at an endpoint
# Replace <your-model> with the model name registered in your serving endpoint
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward \
    --model <your-model> \
    --base-url http://localhost:8000/v1

# Multiple eval functions
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:exact_match \
    --eval-fn rewards:partial_match \
    --model <your-model> \
    --base-url http://localhost:8000/v1

Comparison Baselines with LiteLLM

Benchmark against external LLM providers using LiteLLM format. These commands can be run directly from the example repository with only an API key:

export OPENAI_API_KEY="your-key"

# Compare against GPT-5-mini as a baseline
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --model openai/gpt-5-mini

# Compare against Claude
export ANTHROPIC_API_KEY="your-key"
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --model anthropic/claude-sonnet-4-5

Dataset Format

Eval mode uses the same dataset format as Test Mode:

Column	Required	Description
`system_prompt`	Yes	System prompt for the LLM
`user_prompt`	Yes	User message to start conversation
`ground_truth`	No	Expected output for eval function scoring

Additional columns are passed to eval functions via metadata (full mode) or extra_info (simple mode). Supported formats: .jsonl, .json, .parquet

Eval Functions

Eval functions score agent outputs. Two signatures are supported, auto-detected by the first parameter name.

Simple Mode (compatible with `@osmosis_reward`)

Use when you only need the final assistant response:

def exact_match(solution_str: str, ground_truth: str, extra_info: dict = None, **kwargs) -> float:
    """Score based on the last assistant message content."""
    return 1.0 if solution_str.strip() == ground_truth.strip() else 0.0

First parameter must be named solution_str
solution_str is extracted from the last assistant message
Compatible with existing @osmosis_reward functions without modification

Full Mode

Use when you need the complete conversation history:

def conversation_quality(messages: list, ground_truth: str, metadata: dict, **kwargs) -> float:
    """Score based on the full conversation."""
    assistant_messages = [m for m in messages if m["role"] == "assistant"]
    return min(1.0, len(assistant_messages) / 3)

First parameter must be named messages
Receives the complete message list from the agent run
Both sync and async functions are supported

pass@k Analysis

When --n is greater than 1, eval mode runs each dataset row multiple times and computes pass@k metrics. pass@k estimates the probability that at least one of k randomly selected samples from n total runs passes (score >= threshold).

# pass@k with 5 runs per row
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model <your-model> --base-url http://localhost:8000/v1

# Custom pass threshold (default is 1.0)
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 --pass-threshold 0.5 \
    --model <your-model> --base-url http://localhost:8000/v1

Concurrent Execution

Use --batch-size to run multiple requests in parallel:

# Run 5 concurrent requests
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward \
    --model <your-model> --base-url http://localhost:8000/v1 \
    --batch-size 5

# Combine with pass@k — 10 runs per row, 5 concurrent
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 10 --batch-size 5 \
    --model <your-model> --base-url http://localhost:8000/v1

Model Endpoints

Eval mode works with any OpenAI-compatible serving endpoint via --base-url:

Serving Platform	Example `--base-url`
vLLM	`http://localhost:8000/v1`
SGLang	`http://localhost:30000/v1`
Ollama	`http://localhost:11434/v1`
Any OpenAI-compatible API	`http://<host>:<port>/v1`

The --model parameter should match the model name as registered in the serving endpoint.

Output Format

Save results with -o:

osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model <your-model> --base-url http://localhost:8000/v1 \
    -o results.json

{
  "config": {
    "model": "<your-model>",
    "n_runs": 5,
    "pass_threshold": 1.0,
    "eval_fns": ["rewards:compute_reward"]
  },
  "summary": {
    "total_rows": 100,
    "total_runs": 500,
    "eval_fns": {
      "rewards:compute_reward": {
        "mean": 0.72,
        "std": 0.45,
        "min": 0.0,
        "max": 1.0,
        "pass_at_1": 0.72,
        "pass_at_3": 0.94,
        "pass_at_5": 0.98
      }
    },
    "total_tokens": 625000,
    "total_duration_ms": 230500
  },
  "rows": [
    {
      "row_index": 0,
      "runs": [
        {
          "run_index": 0,
          "success": true,
          "scores": {"rewards:compute_reward": 1.0},
          "duration_ms": 450,
          "tokens": 200
        }
      ]
    }
  ]
}

For the complete list of CLI options, see the CLI Reference.

Tips & Best Practices

Start with --n 1

Validate your eval functions work before running expensive pass@k analyses.

Reuse @osmosis_reward Functions

Existing reward functions work as eval functions in simple mode without modification.

Use Multiple Eval Functions

Evaluate different quality dimensions — correctness, efficiency, format compliance — in a single run.

Compare Against Baselines

Run the same benchmark with LiteLLM providers (e.g., --model openai/gpt-5-mini) to establish baseline performance.

Tune --batch-size for Throughput

Concurrent execution can significantly reduce wall time. Start with a moderate value (e.g., 5) and increase based on endpoint capacity.

Remote Rollout

Overview

Quick Start

Benchmark a Trained Model

Comparison Baselines with LiteLLM

Dataset Format

Eval Functions

Simple Mode (compatible with `@osmosis_reward`)

Full Mode

pass@k Analysis

Concurrent Execution

Model Endpoints

Output Format

Tips & Best Practices

Next Steps

Test Mode

CLI Reference

Remote Rollout

​Overview

​Quick Start

​Benchmark a Trained Model

​Comparison Baselines with LiteLLM

​Dataset Format

​Eval Functions

​Simple Mode (compatible with @osmosis_reward)

​Full Mode

​pass@k Analysis

​Concurrent Execution

​Model Endpoints

​Output Format

​Tips & Best Practices

​Next Steps

Test Mode

CLI Reference

Overview

Quick Start

Benchmark a Trained Model

Comparison Baselines with LiteLLM

Dataset Format

Eval Functions

Simple Mode (compatible with `@osmosis_reward`)

Full Mode

pass@k Analysis

Concurrent Execution

Model Endpoints

Output Format

Tips & Best Practices

Next Steps