Creating a Training Run
From your project dashboard:- Click New Training Run
- Configure the run (see sections below)
- Click Start Training
Configuration
Base Model
Select the foundation model to fine-tune. The platform supports popular open-weight models compatible with RL training.Dataset
Upload or select a training dataset. Datasets contain prompts (and optionally ground truth) that the model trains on during rollouts.Reward Functions
Select one or more reward functions to score the model’s outputs during training:- Synced reward functions — Imported from your connected GitHub repository via Git Sync
- Reward rubrics — LLM-evaluated rubrics that use configured providers
Tools
Select the MCP tools available to the agent during training rollouts. Tools are synced from your connected repository or defined in your Remote Rollout server.Hyperparameters
Key training hyperparameters you can configure:| Parameter | Description |
|---|---|
| Learning rate | Step size for model updates |
| Batch size | Number of samples per training step |
| Max turns | Maximum agent turns per rollout episode |
| KL penalty | Coefficient for KL divergence penalty (prevents catastrophic forgetting) |
| Epochs | Number of passes through the dataset |
| Temperature | Sampling temperature during rollouts |
Reward Rubrics
Reward rubrics (formerly LLM Judges) use external LLM providers to evaluate model outputs during training. Instead of writing deterministic scoring logic, you describe evaluation criteria in natural language and an LLM scores the output.How Reward Rubrics Work
- During a training rollout, the model generates a response
- The response is sent to an LLM provider (OpenAI, Anthropic, Google, etc.) along with your rubric
- The LLM evaluates the response against your criteria and returns a score
- The score is used as the reward signal for RL training
LLM judges require API keys for the providers you want to use. Configure these in Workspace Settings → LLM Provider Keys.
Writing Reward Rubrics
Learn how to write and test rubrics with the
@osmosis_rubric decorator.Training Strategies
Standard Training
Single pass through your dataset with RL optimization. Best for:- Initial training experiments
- Well-defined tasks with clear reward signals
- Smaller datasets
Continuous Training
Multiple epochs with ongoing monitoring. Best for:- Production model improvement
- Large datasets where multiple passes help
- Tasks requiring gradual refinement
Managing Training Runs
Starting and Stopping
- Start — Provisions GPUs and begins training
- Pause — Saves current state and releases resources
- Resume — Continues from the last checkpoint
- Stop — Ends training and finalizes checkpoints
Checkpoints
During training, the platform automatically saves checkpoints at configurable intervals. From the training run page:- View all saved checkpoints with their training step and metrics
- Compare checkpoints by reward scores
- Merge a checkpoint to create a deployable model
- Export merged models to Hugging Face Hub