An evaluation run scores your rollout’sDocumentation Index
Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt
Use this file to discover all available pages before exploring further.
AgentWorkflow and Grader against a platform dataset and reports an aggregate score, pass rate, and per-sample results. The platform pulls your code from the synced workspace repository and runs the evaluation on its own infrastructure — you don’t need GPUs or a training run.
Concepts
Smoke Test or Formal Evaluation
There are two reasons to run one:- As a smoke test before training. Run an evaluation first to confirm the rollout works end-to-end and the grader returns reasonable scores on a small slice, before you commit GPUs to a full training run. Set a small
[evaluation].limitto score only a few rows. - As a formal evaluation. Measure agent quality on its own — to compare models or prompts, track quality over time, or run evaluations from CI. This works the same for a base model or a trained checkpoint. Set
[evaluation].limitto the dataset’s row count to score every row; otherwise the platform scores a random 10% sample.
Evaluation Configuration vs Evaluation Run
An Evaluation Configuration is the recipe — it defines which model, dataset, AgentWorkflow, and evaluation settings to use. An Evaluation Run is a single execution of that configuration. You can submit multiple runs from the same configuration to compare models, prompts, or dataset slices.Submitting an Evaluation Run
Submit an evaluation run using the CLI with a TOML configuration file underconfigs/eval/:
--yes to skip the confirmation prompt in scripts or CI:
Key Configuration Fields
See Config Files for the full TOML reference with all available fields, including
[env] and [secrets].Status Lifecycle
An evaluation run moves through these statuses:| Status | Description |
|---|---|
| pending | Run is queued and waiting for resources to be provisioned. |
| running | Evaluation is actively executing against the dataset. |
| finished | Evaluation completed successfully. Score, pass rate, and sample counts are available. |
| failed | Evaluation encountered an error during execution. Check logs for details. |
| stopped | Evaluation was manually stopped by a user via the CLI or dashboard. |
Monitoring
Track evaluation progress through the CLI or the platform dashboard.CLI Commands
info output includes the model, dataset, rollout, and timestamps, plus the aggregate score, pass rate, and total sample count once the run finishes. While a run is pending or running, results are a live snapshot.
Platform Dashboard
The web dashboard at platform.osmosis.ai lists evaluation runs alongside training runs, where you can filter by status, dataset, model, and rollout, and inspect per-run scores and samples.Managing Runs
Stopping a Run
stopped once the platform finishes cleanup. Pass --yes to skip the confirmation prompt.
Next Steps
Config Files
Reference for the evaluation TOML config.
Datasets & Models
Upload datasets and inspect supported base models.
Training Runs
Submit a training run once your evaluation results look healthy.