Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt

Use this file to discover all available pages before exploring further.

An evaluation run scores your rollout’s AgentWorkflow and Grader against a platform dataset and reports an aggregate score, pass rate, and per-sample results. The platform pulls your code from the synced workspace repository and runs the evaluation on its own infrastructure — you don’t need GPUs or a training run.

Concepts

Smoke Test or Formal Evaluation

There are two reasons to run one:
  • As a smoke test before training. Run an evaluation first to confirm the rollout works end-to-end and the grader returns reasonable scores on a small slice, before you commit GPUs to a full training run. Set a small [evaluation].limit to score only a few rows.
  • As a formal evaluation. Measure agent quality on its own — to compare models or prompts, track quality over time, or run evaluations from CI. This works the same for a base model or a trained checkpoint. Set [evaluation].limit to the dataset’s row count to score every row; otherwise the platform scores a random 10% sample.

Evaluation Configuration vs Evaluation Run

An Evaluation Configuration is the recipe — it defines which model, dataset, AgentWorkflow, and evaluation settings to use. An Evaluation Run is a single execution of that configuration. You can submit multiple runs from the same configuration to compare models, prompts, or dataset slices.

Submitting an Evaluation Run

Submit an evaluation run using the CLI with a TOML configuration file under configs/eval/:
osmosis eval submit configs/eval/my-rollout.toml
Git Sync is the source of truth for your rollout code. The CLI reads config values from the local TOML file you pass, but rollout code comes from the synced workspace repository. Commit, push, and wait for sync before submitting code changes; set commit_sha when you need a specific synced revision.
Pass --yes to skip the confirmation prompt in scripts or CI:
osmosis eval submit configs/eval/my-rollout.toml --yes

Key Configuration Fields

[experiment]
rollout = "my-rollout"                  # Rollout directory name (under rollouts/)
entrypoint = "main.py"                  # Entrypoint file name
model_path = "openai/gpt-5-mini"        # LiteLLM-style model name
dataset = "my-dataset"                  # Platform dataset name
# commit_sha = "abc123..."              # Optional: pin to a specific synced commit

[evaluation]
# Optional. Omit values to use platform defaults.
# limit = 200                           # First N rows; omit for random 10% sample
# n = 1                                 # Evaluation attempts per row
# batch_size = 1                        # Rows evaluated per batch
# pass_threshold = 1.0                  # Minimum passing score
# agent_workflow_timeout_s = 450        # Agent workflow timeout per row
# grader_timeout_s = 150                # Grader timeout per row
See Config Files for the full TOML reference with all available fields, including [env] and [secrets].

Status Lifecycle

An evaluation run moves through these statuses:
StatusDescription
pendingRun is queued and waiting for resources to be provisioned.
runningEvaluation is actively executing against the dataset.
finishedEvaluation completed successfully. Score, pass rate, and sample counts are available.
failedEvaluation encountered an error during execution. Check logs for details.
stoppedEvaluation was manually stopped by a user via the CLI or dashboard.

Monitoring

Track evaluation progress through the CLI or the platform dashboard.

CLI Commands

# List evaluation runs for the current workspace repository
osmosis eval list
osmosis eval list --all

# Show details and results for a single run
osmosis eval info my-eval-run
The info output includes the model, dataset, rollout, and timestamps, plus the aggregate score, pass rate, and total sample count once the run finishes. While a run is pending or running, results are a live snapshot.

Platform Dashboard

The web dashboard at platform.osmosis.ai lists evaluation runs alongside training runs, where you can filter by status, dataset, model, and rollout, and inspect per-run scores and samples.

Managing Runs

Stopping a Run

osmosis eval stop my-eval-run
This requests a stop for a pending or running evaluation. The run moves to stopped once the platform finishes cleanup. Pass --yes to skip the confirmation prompt.

Next Steps

Config Files

Reference for the evaluation TOML config.

Datasets & Models

Upload datasets and inspect supported base models.

Training Runs

Submit a training run once your evaluation results look healthy.