Skip to main content
A training run is a single RL training session that improves a base model using your reward functions, tools, and datasets. The platform handles GPU provisioning, the training loop, and checkpoint management.

Creating a Training Run

From your project dashboard:
  1. Click New Training Run
  2. Configure the run (see sections below)
  3. Click Start Training

Configuration

Base Model

Select the foundation model to fine-tune. The platform supports popular open-weight models compatible with RL training.

Dataset

Upload or select a training dataset. Datasets contain prompts (and optionally ground truth) that the model trains on during rollouts.
Datasets are typically in Parquet format with columns for prompts and expected outputs. See the Remote Rollout quickstart for dataset format examples.

Reward Functions

Select one or more reward functions to score the model’s outputs during training:
  • Synced reward functions — Imported from your connected GitHub repository via Git Sync
  • Reward rubrics — LLM-evaluated rubrics that use configured providers
Multiple reward functions can be combined with configurable weights.

Tools

Select the MCP tools available to the agent during training rollouts. Tools are synced from your connected repository or defined in your Remote Rollout server.

Hyperparameters

Key training hyperparameters you can configure:
ParameterDescription
Learning rateStep size for model updates
Batch sizeNumber of samples per training step
Max turnsMaximum agent turns per rollout episode
KL penaltyCoefficient for KL divergence penalty (prevents catastrophic forgetting)
EpochsNumber of passes through the dataset
TemperatureSampling temperature during rollouts
Default values are provided and work well for most use cases. Adjust based on your training results.

Reward Rubrics

Reward rubrics (formerly LLM Judges) use external LLM providers to evaluate model outputs during training. Instead of writing deterministic scoring logic, you describe evaluation criteria in natural language and an LLM scores the output.

How Reward Rubrics Work

  1. During a training rollout, the model generates a response
  2. The response is sent to an LLM provider (OpenAI, Anthropic, Google, etc.) along with your rubric
  3. The LLM evaluates the response against your criteria and returns a score
  4. The score is used as the reward signal for RL training
LLM judges require API keys for the providers you want to use. Configure these in Workspace Settings → LLM Provider Keys.

Writing Reward Rubrics

Learn how to write and test rubrics with the @osmosis_rubric decorator.

Training Strategies

Standard Training

Single pass through your dataset with RL optimization. Best for:
  • Initial training experiments
  • Well-defined tasks with clear reward signals
  • Smaller datasets

Continuous Training

Multiple epochs with ongoing monitoring. Best for:
  • Production model improvement
  • Large datasets where multiple passes help
  • Tasks requiring gradual refinement

Managing Training Runs

Starting and Stopping

  • Start — Provisions GPUs and begins training
  • Pause — Saves current state and releases resources
  • Resume — Continues from the last checkpoint
  • Stop — Ends training and finalizes checkpoints

Checkpoints

During training, the platform automatically saves checkpoints at configurable intervals. From the training run page:
  • View all saved checkpoints with their training step and metrics
  • Compare checkpoints by reward scores
  • Merge a checkpoint to create a deployable model
  • Export merged models to Hugging Face Hub
See Monitoring for details on tracking training progress and managing checkpoints.

Next Steps