Local Evaluation

osmosis eval run executes your AgentWorkflow and Grader against a local dataset using your own LLM API key. It produces a summary report with rewards and pass rates — use it to measure agent behavior, compare models, and iterate on your workflow or grader without consuming training resources.

osmosis eval run also doubles as a smoke test before osmosis train submit. Pass --limit N (or point at a small dataset) to run only a handful of prompts and catch a broken workflow or grader locally — much cheaper than finding the bug after the platform has provisioned GPUs.

osmosis eval run configs/eval/default.toml

How It Works

Load dataset

The orchestrator loads your dataset from the path specified in the config file.

Run workflows

For each row in the dataset, it packages the input columns into an AgentWorkflowContext and runs AgentWorkflow.run(ctx).

Grade outputs

After samples are collected, it creates a GraderContext with the samples and that row’s reference answer (ground_truth, exposed as ctx.label), then runs Grader.grade(ctx).

Report results

Collects all rewards and outputs a summary report with pass rates, average scores, and per-row results.

Example Eval Config

See Config Files for the full field reference.

configs/eval/default.toml

[eval]
rollout = "my-rollout"
entrypoint = "main.py"
dataset = "data/test.jsonl"

[llm]
model = "openai/gpt-5.2"

Advanced Features

Caching

Eval results are cached automatically so re-runs skip completed rows. Use --fresh to discard all cached results and start from scratch, or --retry-failed to re-run only rows that previously failed.

Pass@k

Set [runs] n = 5 and pass_threshold = 0.8 in your config to run multiple samples per prompt and calculate pass@k metrics — the probability that at least one of k samples passes the threshold.

Batch concurrency

Set [runs] batch_size = 4 in your config (or use --batch-size 4 on the command line) to evaluate multiple prompts in parallel, speeding up large eval runs.

Baseline comparison

Add a [baseline] section with model = "openai/gpt-5-mini" to run a second model on the same dataset and compare reward scores side-by-side.

Sample logging

Pass --log-samples to save full conversation histories to JSONL files alongside the results, useful for manual inspection and debugging.

Dataset integrity

The orchestrator fingerprints your dataset at the start of an eval run and warns if the file is modified while evaluation is in progress.

From Eval to Training

Once your eval results look healthy, commit your rollout code and submit a training run:

Commit and sync

Commit your rollouts/<name>/ directory and push to the default branch of your connected GitHub repo. The platform pulls the new code in through Git Sync.

Submit a training run

Run osmosis train submit configs/training/default.toml. The platform runs training against the synced code — not your local working tree — so anything you edit locally after pushing won’t affect the run unless you commit, push, re-sync, and re-submit. Pin a specific revision with commit_sha when you want a known-good version while continuing to iterate locally.

CLI

Workspace

Rollout

Local Evaluation

How It Works

Example Eval Config

Advanced Features

From Eval to Training

Next Steps

Config Files

Training Runs

CLI

Workspace

Rollout

Documentation Index

​How It Works

​Example Eval Config

​Advanced Features

​From Eval to Training

​Next Steps

Config Files

Training Runs

How It Works

Example Eval Config

Advanced Features

From Eval to Training

Next Steps