Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt

Use this file to discover all available pages before exploring further.

osmosis eval run executes your AgentWorkflow and Grader against a local dataset using your own LLM API key. It produces a summary report with rewards and pass rates — use it to measure agent behavior, compare models, and iterate on your workflow or grader without consuming training resources.
osmosis eval run also doubles as a smoke test before osmosis train submit. Pass --limit N (or point at a small dataset) to run only a handful of prompts and catch a broken workflow or grader locally — much cheaper than finding the bug after the platform has provisioned GPUs.
osmosis eval run configs/eval/default.toml

How It Works

1

Load dataset

The orchestrator loads your dataset from the path specified in the config file.
2

Run workflows

For each row in the dataset, it packages the input columns into an AgentWorkflowContext and runs AgentWorkflow.run(ctx).
3

Grade outputs

After samples are collected, it creates a GraderContext with the samples and that row’s reference answer (ground_truth, exposed as ctx.label), then runs Grader.grade(ctx).
4

Report results

Collects all rewards and outputs a summary report with pass rates, average scores, and per-row results.

Example Eval Config

See Config Files for the full field reference.
configs/eval/default.toml
[eval]
rollout = "my-rollout"
entrypoint = "main.py"
dataset = "data/test.jsonl"

[llm]
model = "openai/gpt-5.2"

Advanced Features

Eval results are cached automatically so re-runs skip completed rows. Use --fresh to discard all cached results and start from scratch, or --retry-failed to re-run only rows that previously failed.
Set [runs] n = 5 and pass_threshold = 0.8 in your config to run multiple samples per prompt and calculate pass@k metrics — the probability that at least one of k samples passes the threshold.
Set [runs] batch_size = 4 in your config (or use --batch-size 4 on the command line) to evaluate multiple prompts in parallel, speeding up large eval runs.
Add a [baseline] section with model = "openai/gpt-5-mini" to run a second model on the same dataset and compare reward scores side-by-side.
Pass --log-samples to save full conversation histories to JSONL files alongside the results, useful for manual inspection and debugging.
The orchestrator fingerprints your dataset at the start of an eval run and warns if the file is modified while evaluation is in progress.

From Eval to Training

Once your eval results look healthy, commit your rollout code and submit a training run:
1

Commit and sync

Commit your rollouts/<name>/ directory and push to the default branch of your connected GitHub repo. The platform pulls the new code in through Git Sync.
2

Submit a training run

Run osmosis train submit configs/training/default.toml. The platform runs training against the synced code — not your local working tree — so anything you edit locally after pushing won’t affect the run unless you commit, push, re-sync, and re-submit. Pin a specific revision with commit_sha when you want a known-good version while continuing to iterate locally.

Next Steps

Config Files

Full reference for eval and training configuration files.

Training Runs

Submit a training run once your rollout passes local eval.