Evaluation Runs

An evaluation run scores your rollout’s AgentWorkflow and Grader against a platform dataset and reports an aggregate score, pass rate, and per-sample results. The platform pulls your code from the synced workspace repository and runs the evaluation on its own infrastructure — you don’t need GPUs or a training run.

Concepts

Smoke Test or Formal Evaluation

There are two reasons to run one:

As a smoke test before training. Run an evaluation first to confirm the rollout works end-to-end and the grader returns reasonable scores on a small slice, before you commit GPUs to a full training run. Set a small [evaluation].limit to score only a few rows.
As a formal evaluation. Measure agent quality on its own — to compare models or prompts, track quality over time, or run evaluations from CI. This works the same for a base model or a trained checkpoint. Set [evaluation].limit to the dataset’s row count to score every row; otherwise the platform scores a random 10% sample.

Evaluation Configuration vs Evaluation Run

An Evaluation Configuration is the recipe — it defines which model, dataset, AgentWorkflow, and evaluation settings to use. An Evaluation Run is a single execution of that configuration. You can submit multiple runs from the same configuration to compare models, prompts, or dataset slices.

Submitting an Evaluation Run

Submit an evaluation run using the CLI with a TOML configuration file under configs/eval/:

osmosis eval submit configs/eval/my-rollout.toml

Git Sync is the source of truth for your rollout code. The CLI reads config values from the local TOML file you pass, but rollout code comes from the synced workspace repository. Commit, push, and wait for sync before submitting code changes; set commit_sha when you need a specific synced revision.

Pass --yes to skip the confirmation prompt in scripts or CI:

osmosis eval submit configs/eval/my-rollout.toml --yes

Key Configuration Fields

[experiment]
rollout = "my-rollout"                  # Rollout directory name (under rollouts/)
entrypoint = "main.py"                  # Entrypoint file name
model_path = "openai/gpt-5-mini"        # LiteLLM-style model name
dataset = "my-dataset"                  # Platform dataset name
# commit_sha = "abc123..."              # Optional: pin to a specific synced commit

[evaluation]
# Optional. Omit values to use platform defaults.
# limit = 200                           # First N rows; omit for random 10% sample
# n = 1                                 # Evaluation attempts per row
# batch_size = 1                        # Rows evaluated per batch
# pass_threshold = 1.0                  # Minimum passing score
# agent_workflow_timeout_s = 450        # Agent workflow timeout per row
# grader_timeout_s = 150                # Grader timeout per row

See Config Files for the full TOML reference with all available fields, including [env] and [secrets].

Status Lifecycle

An evaluation run moves through these statuses:

Status	Description
pending	Run is queued and waiting for resources to be provisioned.
running	Evaluation is actively executing against the dataset.
finished	Evaluation completed successfully. Score, pass rate, and sample counts are available.
failed	Evaluation encountered an error during execution. Check logs for details.
stopped	Evaluation was manually stopped by a user via the CLI or dashboard.

Monitoring

Track evaluation progress through the CLI or the platform dashboard.

CLI Commands

# List evaluation runs for the current workspace repository
osmosis eval list
osmosis eval list --all

# Show details and results for a single run
osmosis eval info my-eval-run

The info output includes the model, dataset, rollout, and timestamps, plus the aggregate score, pass rate, and total sample count once the run finishes. While a run is pending or running, results are a live snapshot. The sidebar reports progress (rows completed and percent) and duration. Dedicated Configuration and Results sections surface the entrypoint, commit SHA, dataset stats, pass thresholds, pass@k, token limits, resolved secret scopes, [env] keys, and the most recent platform logs.

n is the number of evaluation attempts per dataset row. With limit = L and n = N, the platform runs up to L * N total evaluations (or sampled_rows * n when using sampling).

Platform Dashboard

The web dashboard at platform.osmosis.ai lists evaluation runs alongside training runs, where you can filter by status, dataset, model, and rollout, and inspect per-run scores and samples.

Managing Runs

Stopping a Run

osmosis eval stop my-eval-run

This requests a stop for a pending or running evaluation. The run moves to stopped once the platform finishes cleanup. Pass --yes to skip the confirmation prompt.

Next Steps

Config Files

Reference for the evaluation TOML config.

Datasets

Upload and validate datasets for evaluation runs.

Training Runs

Submit a training run once your evaluation results look healthy.

​Concepts

​Smoke Test or Formal Evaluation

​Evaluation Configuration vs Evaluation Run

​Submitting an Evaluation Run

​Key Configuration Fields

​Status Lifecycle

​Monitoring

​CLI Commands

​Platform Dashboard

​Managing Runs

​Stopping a Run

​Next Steps

Config Files

Datasets

Training Runs

Concepts

Smoke Test or Formal Evaluation

Evaluation Configuration vs Evaluation Run

Submitting an Evaluation Run

Key Configuration Fields

Status Lifecycle

Monitoring

CLI Commands

Platform Dashboard

Managing Runs

Stopping a Run

Next Steps