commit_sha semantics as osmosis train submit. The platform clones the repository identified by the workspace directory’s origin remote and executes the rollout server-side, so push your changes and confirm Git Sync before submitting.
Evaluation configs must live under configs/eval/ inside a structured Osmosis workspace directory.
osmosis eval submit is also the recommended pre-flight before a training run — run it first to catch problems before committing GPU time.Quick Start
From inside your workspace directory:Evaluation Config
See Config Files for the full field reference.configs/eval/my-rollout.toml
When
[evaluation].limit is omitted, the platform evaluates a random 10% sample of the dataset (at least one row). Set limit to evaluate a fixed number of rows — the first N rows of the dataset, in order.How It Works
Resolve workspace and config
The CLI reads the evaluation TOML, resolves the workspace from the Git
origin remote, and validates the [experiment] and [secrets] sections (plus optional [evaluation] and [env]) locally before submitting.Submit to the platform
The CLI submits the evaluation run request. The platform clones the connected workspace repository (or the pinned
commit_sha) and prepares the evaluation environment.Validate the model
Before evaluating any rows, the platform runs a pre-flight check that confirms
[experiment].model_path is reachable with your configured credentials. If the model is unreachable — wrong name, missing or invalid API key, or provider rate limiting — the run fails early instead of consuming evaluation resources. Provide the model’s provider API key by registering it with osmosis secret set and listing it under [secrets].required (see Configuration Files).Run the rollout server-side
The platform starts your rollout, drives
AgentWorkflow.run(ctx) for each selected row of the platform dataset using [experiment].model_path as the evaluation policy, then runs Grader.grade(ctx) against the row’s ground_truth.Commands
| Command | Description |
|---|---|
osmosis eval submit <config>.toml [--yes] | Submit an evaluation run from a TOML under configs/eval/. |
osmosis eval list [--limit N] [--all] | List evaluation runs for the current workspace directory. |
osmosis eval info <name-or-id> | Show details and results for a specific evaluation run. |
osmosis eval stop <name-or-id> [--yes] | Stop a pending or running evaluation run. |
osmosis eval rubric | Local LLM-as-judge over a JSONL conversation file. Does not touch the platform. |
From Evaluation Run to Training Run
Submit an evaluation run
Run
osmosis eval submit configs/eval/my-rollout.toml. Use osmosis eval list and osmosis eval info <name> to track progress and inspect results.Iterate on rollout code
Push fixes to the workspace repository and re-submit.
commit_sha lets you re-run the same evaluation run against an older revision when comparing changes.Submit a training run
Once evaluation run results look healthy, run
osmosis train submit configs/training/my-rollout.toml. See Training Runs.Local Rubric Scoring
osmosis eval rubric is a local utility for scoring an existing JSONL conversation file with an LLM judge. It does not require a workspace directory or platform authentication, and it does not run a rollout.
Next Steps
Config Files
Full reference for evaluation and training configuration files.
Git Sync
Push and sync rollout code before submitting evaluation runs or training runs.
Training Runs
Submit a training run once evaluation run results look good.