Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Osmosis CLI uses TOML configuration files for two operations: running evaluations and submitting training runs. All config files live under the configs/ directory in your workspace.
In the examples below, commented-out fields (prefixed with #) are optional. Their defaults are shown inline. Required fields are un-commented.

Eval Config

Used by osmosis eval run to evaluate an agent against a dataset.
configs/eval/default.toml
[eval]
rollout = "my-rollout"                         # Rollout name (directory under rollouts/)
entrypoint = "main.py"                         # Entrypoint file (relative to rollout dir)
dataset = "data/test.jsonl"                    # Dataset path (relative to workspace root)
# limit =                                      # Max rows to evaluate
# offset = 0                                   # Skip first N rows
# fresh = false                                # Discard cached results
# retry_failed = false                         # Re-run only failed rows

[llm]
model = "openai/gpt-5.2"                       # LiteLLM model name (required)
# base_url =                                   # Custom OpenAI-compatible endpoint
# api_key_env = "OPENAI_API_KEY"               # Env var name for API key

[runs]
# n = 1                                        # Runs per row (for pass@n)
# batch_size = 1                               # Concurrent batch size
# pass_threshold = 1.0                         # Score threshold for pass@k

[output]
# log_samples = false                          # Save conversations to JSONL
# output_path =                                # Structured output directory
# quiet = false                                # Suppress progress output
# debug = false                                # Enable debug logging + trace

# [baseline]
# model = "openai/gpt-5-mini"                  # Baseline model for comparison
# base_url =                                   # Custom endpoint for baseline
# api_key_env =                                # Env var name for baseline API key

Field Reference

[eval] — Required

FieldTypeRequiredDefaultDescription
rolloutstrYesName of the rollout directory under rollouts/
entrypointstrYesPython entrypoint file relative to the rollout directory
datasetstrYesPath to the dataset file, relative to the workspace root
limitintNoall rowsMaximum number of rows to evaluate (must be >= 1)
offsetintNo0Number of rows to skip from the beginning (must be >= 0)
freshboolNofalseDiscard all cached results and re-run from scratch
retry_failedboolNofalseRe-run only rows that previously failed

[llm] — Required

FieldTypeRequiredDefaultDescription
modelstrYesModel name in LiteLLM format (e.g. openai/gpt-5.2, anthropic/claude-sonnet-4-5)
base_urlstrNoCustom OpenAI-compatible API endpoint URL
api_key_envstrNoName of the environment variable containing the API key (e.g. OPENAI_API_KEY)

[runs] — Optional

FieldTypeDefaultConstraintsDescription
nint1>= 1Number of evaluation runs per dataset row. Use values > 1 for pass@n metrics.
batch_sizeint1>= 1Number of concurrent evaluation requests
pass_thresholdfloat1.00.0 - 1.0Score threshold at or above which a sample counts as “passed” for pass@k calculation

[output] — Optional

FieldTypeDefaultDescription
log_samplesboolfalseSave full conversation logs to JSONL files alongside results
output_pathstrDirectory for structured output files. Overridden by the -o CLI flag.
quietboolfalseSuppress progress bars and intermediate output
debugboolfalseEnable debug-level logging and execution tracing

[baseline] — Optional

FieldTypeRequiredDescription
modelstrYes (if section present)Baseline model in LiteLLM format for comparison scoring
base_urlstrNoCustom endpoint for the baseline model
api_key_envstrNoEnv var name for the baseline model’s API key
CLI flags like --fresh, --limit, --offset, --quiet, --debug, --output-path, --log-samples, and --batch-size override their config file counterparts, so you can keep a base config and adjust individual runs from the command line.

Training Config

Used by osmosis train submit to submit a training run.
rollout and entrypoint point into the synced repository on the platform, not your local workspace — the training run executes code fetched via Git Sync. By default the platform uses the latest synced commit on the default branch; set commit_sha to pin a run to a specific revision.
configs/training/default.toml
[experiment]
rollout = "my-rollout"                         # Rollout name (directory under rollouts/)
entrypoint = "main.py"                         # Entrypoint file name
model_path = "Qwen/Qwen3.5-35B-A3B"            # Must be a supported model
dataset = "my-dataset"                         # Dataset name from `osmosis dataset list`
# commit_sha =                                 # Pin to a specific commit

[training]
# lr = 1e-6                                    # Learning rate
# total_epochs = 1                             # Number of training epochs
# n_samples_per_prompt = 8                     # Rollout samples per prompt
# global_batch_size = 64                       # Training batch size
# max_prompt_length = 8192                     # Max prompt tokens
# max_response_length = 8192                   # Max response tokens

[sampling]
# rollout_temperature = 1.0                    # Sampling temperature during rollouts
# rollout_top_p = 1.0                          # Top-p sampling

[checkpoints]
# eval_interval =                              # Evaluate every N rollouts
# checkpoint_save_freq = 20                    # Save checkpoint every N rollouts

Field Reference

[experiment] — Required

FieldTypeRequiredDescription
rolloutstrYesName of the rollout directory under rollouts/
entrypointstrYesPython entrypoint file name (e.g. main.py)
model_pathstrYesHuggingFace model path. Must be a model supported by the platform.
datasetstrYesDataset name as shown in osmosis dataset list
commit_shastrNoPin training to a specific Git commit SHA from the synced repository. When omitted, the platform uses the latest synced commit on the default branch. Useful for reproducibility and for submitting a known-good revision while iterating locally.

[training] — Optional

FieldTypeDefaultDescription
lrfloatplatform defaultLearning rate for the optimizer
total_epochsintplatform defaultNumber of passes through the dataset
n_samples_per_promptintplatform defaultNumber of rollout samples generated per prompt. Must be a positive integer.
global_batch_sizeintplatform defaultBatch size for RL training updates. Must be divisible by n_samples_per_prompt.
max_prompt_lengthintplatform defaultMaximum number of tokens in the prompt
max_response_lengthintplatform defaultMaximum number of tokens in the model response
global_batch_size must be evenly divisible by n_samples_per_prompt. The CLI will reject configs that violate this constraint.

[sampling] — Optional

FieldTypeDefaultDescription
rollout_temperaturefloatplatform defaultTemperature for sampling during rollout generation. Higher values increase diversity.
rollout_top_pfloatplatform defaultTop-p (nucleus) sampling threshold during rollouts

[checkpoints] — Optional

FieldTypeDefaultDescription
eval_intervalintRun evaluation every N rollout steps
checkpoint_save_freqintplatform defaultSave a LoRA checkpoint every N rollout steps
Start with only the required [experiment] fields and let the platform use its defaults for training hyperparameters. Tune values incrementally based on your training run metrics.