Skip to main content
An evaluation run submits against a platform dataset using the same workspace, rollout, entrypoint, dataset, and optional commit_sha semantics as osmosis train submit. The platform clones the repository identified by the workspace directory’s origin remote and executes the rollout server-side, so push your changes and confirm Git Sync before submitting. Evaluation configs must live under configs/eval/ inside a structured Osmosis workspace directory.
osmosis eval submit is also the recommended pre-flight before a training run — run it first to catch problems before committing GPU time.

Quick Start

From inside your workspace directory:
osmosis dataset list                                # confirm the platform dataset name
git push                                            # make sure the platform sees your commit
osmosis eval submit configs/eval/my-rollout.toml
Then inspect or manage the run:
osmosis eval list
osmosis eval info <name>
osmosis eval stop <name>

Evaluation Config

See Config Files for the full field reference.
configs/eval/my-rollout.toml
[experiment]
rollout = "my-rollout"                # Rollout directory under rollouts/
entrypoint = "main.py"                # Entrypoint relative to the rollout directory
model_path = "openai/gpt-5-mini"      # LiteLLM-style model name for the evaluation policy
dataset = "my-platform-dataset"       # Platform dataset name from `osmosis dataset list`
# commit_sha =                        # Optional: pin to a specific commit

[evaluation]
# Optional. Omit values to use platform defaults.
# limit = 200
# n = 1
# batch_size = 1
# pass_threshold = 1.0
# agent_workflow_timeout_s = 450
# grader_timeout_s = 150

# [env]
# LOG_LEVEL = "INFO"

[secrets]
# Required for eval configs. Use required = [] only when no secrets are needed.
required = ["OPENAI_API_KEY"]
When [evaluation].limit is omitted, the platform evaluates a random 10% sample of the dataset (at least one row). Set limit to evaluate a fixed number of rows — the first N rows of the dataset, in order.
Git Sync is the source of truth for your rollout code. The CLI reads config values from the local TOML file you pass, but rollout code comes from the synced workspace repository. Commit, push, and wait for sync before submitting code changes; set commit_sha when you need a specific synced revision.

How It Works

1

Resolve workspace and config

The CLI reads the evaluation TOML, resolves the workspace from the Git origin remote, and validates the [experiment] and [secrets] sections (plus optional [evaluation] and [env]) locally before submitting.
2

Submit to the platform

The CLI submits the evaluation run request. The platform clones the connected workspace repository (or the pinned commit_sha) and prepares the evaluation environment.
3

Validate the model

Before evaluating any rows, the platform runs a pre-flight check that confirms [experiment].model_path is reachable with your configured credentials. If the model is unreachable — wrong name, missing or invalid API key, or provider rate limiting — the run fails early instead of consuming evaluation resources. Provide the model’s provider API key by registering it with osmosis secret set and listing it under [secrets].required (see Configuration Files).
4

Run the rollout server-side

The platform starts your rollout, drives AgentWorkflow.run(ctx) for each selected row of the platform dataset using [experiment].model_path as the evaluation policy, then runs Grader.grade(ctx) against the row’s ground_truth.
5

Aggregate results

The platform aggregates rewards, pass rates, and per-row results. Use osmosis eval info <name> (or osmosis --json eval info <name>) to inspect them.

Commands

CommandDescription
osmosis eval submit <config>.toml [--yes]Submit an evaluation run from a TOML under configs/eval/.
osmosis eval list [--limit N] [--all]List evaluation runs for the current workspace directory.
osmosis eval info <name-or-id>Show details and results for a specific evaluation run.
osmosis eval stop <name-or-id> [--yes]Stop a pending or running evaluation run.
osmosis eval rubricLocal LLM-as-judge over a JSONL conversation file. Does not touch the platform.
See the Command Reference for the full flag list.

From Evaluation Run to Training Run

1

Submit an evaluation run

Run osmosis eval submit configs/eval/my-rollout.toml. Use osmosis eval list and osmosis eval info <name> to track progress and inspect results.
2

Iterate on rollout code

Push fixes to the workspace repository and re-submit. commit_sha lets you re-run the same evaluation run against an older revision when comparing changes.
3

Submit a training run

Once evaluation run results look healthy, run osmosis train submit configs/training/my-rollout.toml. See Training Runs.

Local Rubric Scoring

osmosis eval rubric is a local utility for scoring an existing JSONL conversation file with an LLM judge. It does not require a workspace directory or platform authentication, and it does not run a rollout.
osmosis eval rubric -d conversations.jsonl \
  --rubric "Evaluate the assistant's helpfulness..." \
  --model openai/gpt-5-mini
See the Command Reference for the full flag list.

Next Steps

Config Files

Full reference for evaluation and training configuration files.

Git Sync

Push and sync rollout code before submitting evaluation runs or training runs.

Training Runs

Submit a training run once evaluation run results look good.