RolloutAgentLoop against datasets with custom eval functions, statistical analysis, and pass@k metrics.
Overview
- Benchmarks trained models served at OpenAI-compatible endpoints
- Scores agent outputs with custom eval functions
- Supports pass@k analysis with multiple runs per row
- Concurrent execution with
--batch-sizefor faster benchmarks - Reuses existing
@osmosis_rewardfunctions as eval functions
osmosis eval is for evaluating agent performance with custom eval functions. For rubric-based evaluation of JSONL conversations, use osmosis eval-rubric instead.Quick Start
Benchmark a Trained Model
Connect to your model serving endpoint and evaluate:Comparison Baselines with LiteLLM
Benchmark against external LLM providers using LiteLLM format. These commands can be run directly from the example repository with only an API key:Dataset Format
Eval mode uses the same dataset format as Test Mode:| Column | Required | Description |
|---|---|---|
system_prompt | Yes | System prompt for the LLM |
user_prompt | Yes | User message to start conversation |
ground_truth | No | Expected output for eval function scoring |
metadata (full mode) or extra_info (simple mode).
Supported formats: .jsonl, .json, .parquet
Eval Functions
Eval functions score agent outputs. Two signatures are supported, auto-detected by the first parameter name.Simple Mode (compatible with @osmosis_reward)
Use when you only need the final assistant response:
- First parameter must be named
solution_str solution_stris extracted from the last assistant message- Compatible with existing
@osmosis_rewardfunctions without modification
Full Mode
Use when you need the complete conversation history:- First parameter must be named
messages - Receives the complete message list from the agent run
- Both sync and async functions are supported
pass@k Analysis
When--n is greater than 1, eval mode runs each dataset row multiple times and computes pass@k metrics.
pass@k estimates the probability that at least one of k randomly selected samples from n total runs passes (score >= threshold).
Concurrent Execution
Use--batch-size to run multiple requests in parallel:
Model Endpoints
Eval mode works with any OpenAI-compatible serving endpoint via--base-url:
| Serving Platform | Example --base-url |
|---|---|
| vLLM | http://localhost:8000/v1 |
| SGLang | http://localhost:30000/v1 |
| Ollama | http://localhost:11434/v1 |
| Any OpenAI-compatible API | http://<host>:<port>/v1 |
--model parameter should match the model name as registered in the serving endpoint.
Output Format
Save results with-o:
Tips & Best Practices
Start with --n 1
Start with --n 1
Validate your eval functions work before running expensive pass@k analyses.
Reuse @osmosis_reward Functions
Reuse @osmosis_reward Functions
Existing reward functions work as eval functions in simple mode without modification.
Use Multiple Eval Functions
Use Multiple Eval Functions
Evaluate different quality dimensions — correctness, efficiency, format compliance — in a single run.
Compare Against Baselines
Compare Against Baselines
Run the same benchmark with LiteLLM providers (e.g.,
--model openai/gpt-5-mini) to establish baseline performance.Tune --batch-size for Throughput
Tune --batch-size for Throughput
Concurrent execution can significantly reduce wall time. Start with a moderate value (e.g.,
5) and increase based on endpoint capacity.