Configuration Files

The Osmosis CLI uses TOML configuration files for two operations: running evaluations and submitting training runs. All config files live under the configs/ directory in your workspace.

In the examples below, commented-out fields (prefixed with #) are optional. Their defaults are shown inline. Required fields are un-commented.

Eval Config

Used by osmosis eval run to evaluate an agent against a dataset.

configs/eval/default.toml

[eval]
rollout = "my-rollout"                         # Rollout name (directory under rollouts/)
entrypoint = "main.py"                         # Entrypoint file (relative to rollout dir)
dataset = "data/test.jsonl"                    # Dataset path (relative to workspace root)
# limit =                                      # Max rows to evaluate
# offset = 0                                   # Skip first N rows
# fresh = false                                # Discard cached results
# retry_failed = false                         # Re-run only failed rows

[llm]
model = "openai/gpt-5.2"                       # LiteLLM model name (required)
# base_url =                                   # Custom OpenAI-compatible endpoint
# api_key_env = "OPENAI_API_KEY"               # Env var name for API key

[runs]
# n = 1                                        # Runs per row (for pass@n)
# batch_size = 1                               # Concurrent batch size
# pass_threshold = 1.0                         # Score threshold for pass@k

[output]
# log_samples = false                          # Save conversations to JSONL
# output_path =                                # Structured output directory
# quiet = false                                # Suppress progress output
# debug = false                                # Enable debug logging + trace

# [baseline]
# model = "openai/gpt-5-mini"                  # Baseline model for comparison
# base_url =                                   # Custom endpoint for baseline
# api_key_env =                                # Env var name for baseline API key

Field Reference

`[eval]` — Required

Field	Type	Required	Default	Description
`rollout`	`str`	Yes	—	Name of the rollout directory under `rollouts/`
`entrypoint`	`str`	Yes	—	Python entrypoint file relative to the rollout directory
`dataset`	`str`	Yes	—	Path to the dataset file, relative to the workspace root
`limit`	`int`	No	all rows	Maximum number of rows to evaluate (must be >= 1)
`offset`	`int`	No	`0`	Number of rows to skip from the beginning (must be >= 0)
`fresh`	`bool`	No	`false`	Discard all cached results and re-run from scratch
`retry_failed`	`bool`	No	`false`	Re-run only rows that previously failed

`[llm]` — Required

Field	Type	Required	Default	Description
`model`	`str`	Yes	—	Model name in LiteLLM format (e.g. `openai/gpt-5.2`, `anthropic/claude-sonnet-4-5`)
`base_url`	`str`	No	—	Custom OpenAI-compatible API endpoint URL
`api_key_env`	`str`	No	—	Name of the environment variable containing the API key (e.g. `OPENAI_API_KEY`)

`[runs]` — Optional

Field	Type	Default	Constraints	Description
`n`	`int`	`1`	>= 1	Number of evaluation runs per dataset row. Use values > 1 for pass@n metrics.
`batch_size`	`int`	`1`	>= 1	Number of concurrent evaluation requests
`pass_threshold`	`float`	`1.0`	0.0 - 1.0	Score threshold at or above which a sample counts as “passed” for pass@k calculation

`[output]` — Optional

Field	Type	Default	Description
`log_samples`	`bool`	`false`	Save full conversation logs to JSONL files alongside results
`output_path`	`str`	—	Directory for structured output files. Overridden by the `-o` CLI flag.
`quiet`	`bool`	`false`	Suppress progress bars and intermediate output
`debug`	`bool`	`false`	Enable debug-level logging and execution tracing

`[baseline]` — Optional

Field	Type	Required	Description
`model`	`str`	Yes (if section present)	Baseline model in LiteLLM format for comparison scoring
`base_url`	`str`	No	Custom endpoint for the baseline model
`api_key_env`	`str`	No	Env var name for the baseline model’s API key

CLI flags like --fresh, --limit, --offset, --quiet, --debug, --output-path, --log-samples, and --batch-size override their config file counterparts, so you can keep a base config and adjust individual runs from the command line.

Training Config

Used by osmosis train submit to submit a training run.

rollout and entrypoint point into the synced repository on the platform, not your local workspace — the training run executes code fetched via Git Sync. By default the platform uses the latest synced commit on the default branch; set commit_sha to pin a run to a specific revision.

configs/training/default.toml

[experiment]
rollout = "my-rollout"                         # Rollout name (directory under rollouts/)
entrypoint = "main.py"                         # Entrypoint file name
model_path = "Qwen/Qwen3.5-35B-A3B"            # Must be a supported model
dataset = "my-dataset"                         # Dataset name from `osmosis dataset list`
# commit_sha =                                 # Pin to a specific commit

[training]
# lr = 1e-6                                    # Learning rate
# total_epochs = 1                             # Number of training epochs
# n_samples_per_prompt = 8                     # Rollout samples per prompt
# global_batch_size = 64                       # Training batch size
# max_prompt_length = 8192                     # Max prompt tokens
# max_response_length = 8192                   # Max response tokens

[sampling]
# rollout_temperature = 1.0                    # Sampling temperature during rollouts
# rollout_top_p = 1.0                          # Top-p sampling

[checkpoints]
# eval_interval =                              # Evaluate every N rollouts
# checkpoint_save_freq = 20                    # Save checkpoint every N rollouts

Field Reference

`[experiment]` — Required

Field	Type	Required	Description
`rollout`	`str`	Yes	Name of the rollout directory under `rollouts/`
`entrypoint`	`str`	Yes	Python entrypoint file name (e.g. `main.py`)
`model_path`	`str`	Yes	HuggingFace model path. Must be a model supported by the platform.
`dataset`	`str`	Yes	Dataset name as shown in `osmosis dataset list`
`commit_sha`	`str`	No	Pin training to a specific Git commit SHA from the synced repository. When omitted, the platform uses the latest synced commit on the default branch. Useful for reproducibility and for submitting a known-good revision while iterating locally.

`[training]` — Optional

Field	Type	Default	Description
`lr`	`float`	platform default	Learning rate for the optimizer
`total_epochs`	`int`	platform default	Number of passes through the dataset
`n_samples_per_prompt`	`int`	platform default	Number of rollout samples generated per prompt. Must be a positive integer.
`global_batch_size`	`int`	platform default	Batch size for RL training updates. Must be divisible by `n_samples_per_prompt`.
`max_prompt_length`	`int`	platform default	Maximum number of tokens in the prompt
`max_response_length`	`int`	platform default	Maximum number of tokens in the model response

global_batch_size must be evenly divisible by n_samples_per_prompt. The CLI will reject configs that violate this constraint.

`[sampling]` — Optional

Field	Type	Default	Description
`rollout_temperature`	`float`	platform default	Temperature for sampling during rollout generation. Higher values increase diversity.
`rollout_top_p`	`float`	platform default	Top-p (nucleus) sampling threshold during rollouts

`[checkpoints]` — Optional

Field	Type	Default	Description
`eval_interval`	`int`	—	Run evaluation every N rollout steps
`checkpoint_save_freq`	`int`	platform default	Save a LoRA checkpoint every N rollout steps

Start with only the required [experiment] fields and let the platform use its defaults for training hyperparameters. Tune values incrementally based on your training run metrics.

CLI

Workspace

Rollout

Documentation Index

​Eval Config

​Field Reference

​[eval] — Required

​[llm] — Required

​[runs] — Optional

​[output] — Optional

​[baseline] — Optional

​Training Config

​Field Reference

​[experiment] — Required

​[training] — Optional

​[sampling] — Optional

​[checkpoints] — Optional

Eval Config

Field Reference

`[eval]` — Required

`[llm]` — Required

`[runs]` — Optional

`[output]` — Optional

`[baseline]` — Optional

Training Config

Field Reference

`[experiment]` — Required

`[training]` — Optional

`[sampling]` — Optional

`[checkpoints]` — Optional