Documentation Index
Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt
Use this file to discover all available pages before exploring further.
osmosis eval run executes your AgentWorkflow and Grader against a local dataset using your own LLM API key. It produces a summary report with rewards and pass rates — use it to measure agent behavior, compare models, and iterate on your workflow or grader without consuming training resources.
How It Works
Run workflows
For each row in the dataset, it packages the input columns into an
AgentWorkflowContext and runs AgentWorkflow.run(ctx).Grade outputs
After samples are collected, it creates a
GraderContext with the samples and that row’s reference answer (ground_truth, exposed as ctx.label), then runs Grader.grade(ctx).Example Eval Config
See Config Files for the full field reference.configs/eval/default.toml
Advanced Features
Caching
Caching
Eval results are cached automatically so re-runs skip completed rows. Use
--fresh to discard all cached results and start from scratch, or --retry-failed to re-run only rows that previously failed.Pass@k
Pass@k
Set
[runs] n = 5 and pass_threshold = 0.8 in your config to run multiple samples per prompt and calculate pass@k metrics — the probability that at least one of k samples passes the threshold.Batch concurrency
Batch concurrency
Set
[runs] batch_size = 4 in your config (or use --batch-size 4 on the command line) to evaluate multiple prompts in parallel, speeding up large eval runs.Baseline comparison
Baseline comparison
Add a
[baseline] section with model = "openai/gpt-5-mini" to run a second model on the same dataset and compare reward scores side-by-side.Sample logging
Sample logging
Pass
--log-samples to save full conversation histories to JSONL files alongside the results, useful for manual inspection and debugging.Dataset integrity
Dataset integrity
The orchestrator fingerprints your dataset at the start of an eval run and warns if the file is modified while evaluation is in progress.
From Eval to Training
Once your eval results look healthy, commit your rollout code and submit a training run:Commit and sync
Commit your
rollouts/<name>/ directory and push to the default branch of your connected GitHub repo. The platform pulls the new code in through Git Sync.Submit a training run
Run
osmosis train submit configs/training/default.toml. The platform runs training against the synced code — not your local working tree — so anything you edit locally after pushing won’t affect the run unless you commit, push, re-sync, and re-submit. Pin a specific revision with commit_sha when you want a known-good version while continuing to iterate locally.Next Steps
Config Files
Full reference for eval and training configuration files.
Training Runs
Submit a training run once your rollout passes local eval.