Concepts
Training Configuration vs Training Run
A Training Configuration is the recipe — it defines which model, dataset, AgentWorkflow, and hyperparameters to use. A Training Run is a single execution of that configuration. You can submit multiple runs from the same configuration to experiment with different settings.Training Behavior
Each submitted run is a single managed training job for the rollout, dataset, model, and hyperparameters in its TOML config. To run another experiment, submit the config again with updated fields such astotal_epochs, sampling settings, or checkpoint cadence.
Submitting a Training Run
Submit a training run using the CLI with a TOML configuration file:Key Configuration Fields
See Config Files for the full TOML reference with all available fields.
Status Lifecycle
Every training run progresses through a series of statuses:| Status | Description |
|---|---|
| pending | Run is queued and waiting for GPU resources to be provisioned. |
| running | Training is actively in progress. Metrics and checkpoints are being produced. |
| finished | Training completed successfully. Final checkpoint and metrics are available. |
| failed | Training encountered an error during execution. Check logs for details. |
| stopped | Training was manually stopped by a user via the CLI or dashboard. |
| killed | Training was terminated during platform cleanup or stop handling. |
| crashed | Training process terminated unexpectedly (e.g. OOM, hardware failure). |
| unknown | The platform could not determine the current training state. |
A run in
failed or crashed status may still have usable checkpoints saved before the failure occurred.Monitoring
Track training progress through the CLI or the platform dashboard.CLI Commands
train info reports progress (current_step / total_steps) and the most recent reward. train list surfaces the same fields so you can scan runs at a glance.
Platform Dashboard
The web dashboard at platform.osmosis.ai provides:- Run list — search and filter runs by status, dataset, base model, and rollout.
- Overview metrics — view Duration, Reward, Improvement, Samples, Training Reward, Validation Reward, Model Entropy, Response Length, Total Length, and Truncation Ratio when available.
- Checkpoints — view saved checkpoints with their step, reward, deployment status, and Hugging Face upload status.
- Outputs — inspect output artifacts when they are available.
LoRA Checkpoints
During training, LoRA checkpoints are saved at the interval specified bycheckpoint_save_freq in your configuration. Checkpoints capture the adapter weights at a specific training step.
You can:
- Compare checkpoints by their reward scores to find the best-performing step
- Export checkpoints from the dashboard
- Upload checkpoints to Hugging Face
- Deploy LoRA models for inference with
osmosis model deploy
Managing Runs
Stopping a Run
stopped status.
Next Steps
Datasets
Upload datasets for training.
Models
Manage base models and deploy trained LoRA models.