Documentation Index
Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt
Use this file to discover all available pages before exploring further.
The Osmosis CLI uses TOML configuration files for two operations: running evaluations and submitting training runs. All config files live under the configs/ directory in your workspace.
In the examples below, commented-out fields (prefixed with #) are optional. Their defaults are shown inline. Required fields are un-commented.
Eval Config
Used by osmosis eval run to evaluate an agent against a dataset.
configs/eval/default.toml
[eval]
rollout = "my-rollout" # Rollout name (directory under rollouts/)
entrypoint = "main.py" # Entrypoint file (relative to rollout dir)
dataset = "data/test.jsonl" # Dataset path (relative to workspace root)
# limit = # Max rows to evaluate
# offset = 0 # Skip first N rows
# fresh = false # Discard cached results
# retry_failed = false # Re-run only failed rows
[llm]
model = "openai/gpt-5.2" # LiteLLM model name (required)
# base_url = # Custom OpenAI-compatible endpoint
# api_key_env = "OPENAI_API_KEY" # Env var name for API key
[runs]
# n = 1 # Runs per row (for pass@n)
# batch_size = 1 # Concurrent batch size
# pass_threshold = 1.0 # Score threshold for pass@k
[output]
# log_samples = false # Save conversations to JSONL
# output_path = # Structured output directory
# quiet = false # Suppress progress output
# debug = false # Enable debug logging + trace
# [baseline]
# model = "openai/gpt-5-mini" # Baseline model for comparison
# base_url = # Custom endpoint for baseline
# api_key_env = # Env var name for baseline API key
Field Reference
[eval] — Required
| Field | Type | Required | Default | Description |
|---|
rollout | str | Yes | — | Name of the rollout directory under rollouts/ |
entrypoint | str | Yes | — | Python entrypoint file relative to the rollout directory |
dataset | str | Yes | — | Path to the dataset file, relative to the workspace root |
limit | int | No | all rows | Maximum number of rows to evaluate (must be >= 1) |
offset | int | No | 0 | Number of rows to skip from the beginning (must be >= 0) |
fresh | bool | No | false | Discard all cached results and re-run from scratch |
retry_failed | bool | No | false | Re-run only rows that previously failed |
[llm] — Required
| Field | Type | Required | Default | Description |
|---|
model | str | Yes | — | Model name in LiteLLM format (e.g. openai/gpt-5.2, anthropic/claude-sonnet-4-5) |
base_url | str | No | — | Custom OpenAI-compatible API endpoint URL |
api_key_env | str | No | — | Name of the environment variable containing the API key (e.g. OPENAI_API_KEY) |
[runs] — Optional
| Field | Type | Default | Constraints | Description |
|---|
n | int | 1 | >= 1 | Number of evaluation runs per dataset row. Use values > 1 for pass@n metrics. |
batch_size | int | 1 | >= 1 | Number of concurrent evaluation requests |
pass_threshold | float | 1.0 | 0.0 - 1.0 | Score threshold at or above which a sample counts as “passed” for pass@k calculation |
[output] — Optional
| Field | Type | Default | Description |
|---|
log_samples | bool | false | Save full conversation logs to JSONL files alongside results |
output_path | str | — | Directory for structured output files. Overridden by the -o CLI flag. |
quiet | bool | false | Suppress progress bars and intermediate output |
debug | bool | false | Enable debug-level logging and execution tracing |
[baseline] — Optional
| Field | Type | Required | Description |
|---|
model | str | Yes (if section present) | Baseline model in LiteLLM format for comparison scoring |
base_url | str | No | Custom endpoint for the baseline model |
api_key_env | str | No | Env var name for the baseline model’s API key |
CLI flags like --fresh, --limit, --offset, --quiet, --debug, --output-path, --log-samples, and --batch-size override their config file counterparts, so you can keep a base config and adjust individual runs from the command line.
Training Config
Used by osmosis train submit to submit a training run.
rollout and entrypoint point into the synced repository on the platform, not your local workspace — the training run executes code fetched via Git Sync. By default the platform uses the latest synced commit on the default branch; set commit_sha to pin a run to a specific revision.
configs/training/default.toml
[experiment]
rollout = "my-rollout" # Rollout name (directory under rollouts/)
entrypoint = "main.py" # Entrypoint file name
model_path = "Qwen/Qwen3.5-35B-A3B" # Must be a supported model
dataset = "my-dataset" # Dataset name from `osmosis dataset list`
# commit_sha = # Pin to a specific commit
[training]
# lr = 1e-6 # Learning rate
# total_epochs = 1 # Number of training epochs
# n_samples_per_prompt = 8 # Rollout samples per prompt
# global_batch_size = 64 # Training batch size
# max_prompt_length = 8192 # Max prompt tokens
# max_response_length = 8192 # Max response tokens
[sampling]
# rollout_temperature = 1.0 # Sampling temperature during rollouts
# rollout_top_p = 1.0 # Top-p sampling
[checkpoints]
# eval_interval = # Evaluate every N rollouts
# checkpoint_save_freq = 20 # Save checkpoint every N rollouts
Field Reference
[experiment] — Required
| Field | Type | Required | Description |
|---|
rollout | str | Yes | Name of the rollout directory under rollouts/ |
entrypoint | str | Yes | Python entrypoint file name (e.g. main.py) |
model_path | str | Yes | HuggingFace model path. Must be a model supported by the platform. |
dataset | str | Yes | Dataset name as shown in osmosis dataset list |
commit_sha | str | No | Pin training to a specific Git commit SHA from the synced repository. When omitted, the platform uses the latest synced commit on the default branch. Useful for reproducibility and for submitting a known-good revision while iterating locally. |
[training] — Optional
| Field | Type | Default | Description |
|---|
lr | float | platform default | Learning rate for the optimizer |
total_epochs | int | platform default | Number of passes through the dataset |
n_samples_per_prompt | int | platform default | Number of rollout samples generated per prompt. Must be a positive integer. |
global_batch_size | int | platform default | Batch size for RL training updates. Must be divisible by n_samples_per_prompt. |
max_prompt_length | int | platform default | Maximum number of tokens in the prompt |
max_response_length | int | platform default | Maximum number of tokens in the model response |
global_batch_size must be evenly divisible by n_samples_per_prompt. The CLI will reject configs that violate this constraint.
[sampling] — Optional
| Field | Type | Default | Description |
|---|
rollout_temperature | float | platform default | Temperature for sampling during rollout generation. Higher values increase diversity. |
rollout_top_p | float | platform default | Top-p (nucleus) sampling threshold during rollouts |
[checkpoints] — Optional
| Field | Type | Default | Description |
|---|
eval_interval | int | — | Run evaluation every N rollout steps |
checkpoint_save_freq | int | platform default | Save a LoRA checkpoint every N rollout steps |
Start with only the required [experiment] fields and let the platform use its defaults for training hyperparameters. Tune values incrementally based on your training run metrics.