Skip to main content
After training, use eval mode to benchmark your models by running your RolloutAgentLoop against datasets with custom eval functions, statistical analysis, and pass@k metrics.

Overview

osmosis eval -m <module:agent> -d <dataset> --model <model> --eval-fn <module:fn> [options]
Eval Mode:
  • Benchmarks trained models served at OpenAI-compatible endpoints
  • Scores agent outputs with custom eval functions
  • Supports pass@k analysis with multiple runs per row
  • Concurrent execution with --batch-size for faster benchmarks
  • Reuses existing @osmosis_reward functions as eval functions
osmosis eval is for evaluating agent performance with custom eval functions. For rubric-based evaluation of JSONL conversations, use osmosis eval-rubric instead.

Quick Start

The examples below use the osmosis-remote-rollout-example repository. Clone it and run the commands directly:
git clone https://github.com/Osmosis-AI/osmosis-remote-rollout-example.git
cd osmosis-remote-rollout-example && uv sync

Benchmark a Trained Model

Connect to your model serving endpoint and evaluate:
# Benchmark a trained model served at an endpoint
# Replace <your-model> with the model name registered in your serving endpoint
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward \
    --model <your-model> \
    --base-url http://localhost:8000/v1

# Multiple eval functions
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:exact_match \
    --eval-fn rewards:partial_match \
    --model <your-model> \
    --base-url http://localhost:8000/v1

Comparison Baselines with LiteLLM

Benchmark against external LLM providers using LiteLLM format. These commands can be run directly from the example repository with only an API key:
export OPENAI_API_KEY="your-key"

# Compare against GPT-5-mini as a baseline
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --model openai/gpt-5-mini

# Compare against Claude
export ANTHROPIC_API_KEY="your-key"
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --model anthropic/claude-sonnet-4-5

Dataset Format

Eval mode uses the same dataset format as Test Mode:
ColumnRequiredDescription
system_promptYesSystem prompt for the LLM
user_promptYesUser message to start conversation
ground_truthNoExpected output for eval function scoring
Additional columns are passed to eval functions via metadata (full mode) or extra_info (simple mode). Supported formats: .jsonl, .json, .parquet

Eval Functions

Eval functions score agent outputs. Two signatures are supported, auto-detected by the first parameter name.

Simple Mode (compatible with @osmosis_reward)

Use when you only need the final assistant response:
def exact_match(solution_str: str, ground_truth: str, extra_info: dict = None, **kwargs) -> float:
    """Score based on the last assistant message content."""
    return 1.0 if solution_str.strip() == ground_truth.strip() else 0.0
  • First parameter must be named solution_str
  • solution_str is extracted from the last assistant message
  • Compatible with existing @osmosis_reward functions without modification

Full Mode

Use when you need the complete conversation history:
def conversation_quality(messages: list, ground_truth: str, metadata: dict, **kwargs) -> float:
    """Score based on the full conversation."""
    assistant_messages = [m for m in messages if m["role"] == "assistant"]
    return min(1.0, len(assistant_messages) / 3)
  • First parameter must be named messages
  • Receives the complete message list from the agent run
  • Both sync and async functions are supported

pass@k Analysis

When --n is greater than 1, eval mode runs each dataset row multiple times and computes pass@k metrics. pass@k estimates the probability that at least one of k randomly selected samples from n total runs passes (score >= threshold).
# pass@k with 5 runs per row
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model <your-model> --base-url http://localhost:8000/v1

# Custom pass threshold (default is 1.0)
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 --pass-threshold 0.5 \
    --model <your-model> --base-url http://localhost:8000/v1

Concurrent Execution

Use --batch-size to run multiple requests in parallel:
# Run 5 concurrent requests
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward \
    --model <your-model> --base-url http://localhost:8000/v1 \
    --batch-size 5

# Combine with pass@k — 10 runs per row, 5 concurrent
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 10 --batch-size 5 \
    --model <your-model> --base-url http://localhost:8000/v1

Model Endpoints

Eval mode works with any OpenAI-compatible serving endpoint via --base-url:
Serving PlatformExample --base-url
vLLMhttp://localhost:8000/v1
SGLanghttp://localhost:30000/v1
Ollamahttp://localhost:11434/v1
Any OpenAI-compatible APIhttp://<host>:<port>/v1
The --model parameter should match the model name as registered in the serving endpoint.

Output Format

Save results with -o:
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model <your-model> --base-url http://localhost:8000/v1 \
    -o results.json
{
  "config": {
    "model": "<your-model>",
    "n_runs": 5,
    "pass_threshold": 1.0,
    "eval_fns": ["rewards:compute_reward"]
  },
  "summary": {
    "total_rows": 100,
    "total_runs": 500,
    "eval_fns": {
      "rewards:compute_reward": {
        "mean": 0.72,
        "std": 0.45,
        "min": 0.0,
        "max": 1.0,
        "pass_at_1": 0.72,
        "pass_at_3": 0.94,
        "pass_at_5": 0.98
      }
    },
    "total_tokens": 625000,
    "total_duration_ms": 230500
  },
  "rows": [
    {
      "row_index": 0,
      "runs": [
        {
          "run_index": 0,
          "success": true,
          "scores": {"rewards:compute_reward": 1.0},
          "duration_ms": 450,
          "tokens": 200
        }
      ]
    }
  ]
}
For the complete list of CLI options, see the CLI Reference.

Tips & Best Practices

Validate your eval functions work before running expensive pass@k analyses.
Existing reward functions work as eval functions in simple mode without modification.
Evaluate different quality dimensions — correctness, efficiency, format compliance — in a single run.
Run the same benchmark with LiteLLM providers (e.g., --model openai/gpt-5-mini) to establish baseline performance.
Concurrent execution can significantly reduce wall time. Start with a moderate value (e.g., 5) and increase based on endpoint capacity.

Next Steps