Training Runs

A training run is a single RL training session that improves a base model using your reward functions, tools, and datasets. The platform handles GPU provisioning, the training loop, and checkpoint management.

Creating a Training Run

From your project dashboard:

Click New Training Run
Configure the run (see sections below)
Click Start Training

Configuration

Base Model

Select the foundation model to fine-tune. The platform supports popular open-weight models compatible with RL training.

Dataset

Upload or select a training dataset. Datasets contain prompts (and optionally ground truth) that the model trains on during rollouts.

Datasets are typically in Parquet format with columns for prompts and expected outputs. See the Remote Rollout quickstart for dataset format examples.

Reward Functions

Select one or more reward functions to score the model’s outputs during training:

Synced reward functions — Imported from your connected GitHub repository via Git Sync
Reward rubrics — LLM-evaluated rubrics that use configured providers

Multiple reward functions can be combined with configurable weights.

Tools

Select the MCP tools available to the agent during training rollouts. Tools are synced from your connected repository or defined in your Remote Rollout server.

Hyperparameters

Key training hyperparameters you can configure:

Parameter	Description
Learning rate	Step size for model updates
Batch size	Number of samples per training step
Max turns	Maximum agent turns per rollout episode
KL penalty	Coefficient for KL divergence penalty (prevents catastrophic forgetting)
Epochs	Number of passes through the dataset
Temperature	Sampling temperature during rollouts

Default values are provided and work well for most use cases. Adjust based on your training results.

Reward Rubrics

Reward rubrics (formerly LLM Judges) use external LLM providers to evaluate model outputs during training. Instead of writing deterministic scoring logic, you describe evaluation criteria in natural language and an LLM scores the output.

How Reward Rubrics Work

During a training rollout, the model generates a response
The response is sent to an LLM provider (OpenAI, Anthropic, Google, etc.) along with your rubric
The LLM evaluates the response against your criteria and returns a score
The score is used as the reward signal for RL training

LLM judges require API keys for the providers you want to use. Configure these in Workspace Settings → LLM Provider Keys.

Writing Reward Rubrics

Learn how to write and test rubrics with the @osmosis_rubric decorator.

Training Strategies

Standard Training

Single pass through your dataset with RL optimization. Best for:

Initial training experiments
Well-defined tasks with clear reward signals
Smaller datasets

Continuous Training

Multiple epochs with ongoing monitoring. Best for:

Production model improvement
Large datasets where multiple passes help
Tasks requiring gradual refinement

Managing Training Runs

Starting and Stopping

Start — Provisions GPUs and begins training
Pause — Saves current state and releases resources
Resume — Continues from the last checkpoint
Stop — Ends training and finalizes checkpoints

Checkpoints

During training, the platform automatically saves checkpoints at configurable intervals. From the training run page:

View all saved checkpoints with their training step and metrics
Compare checkpoints by reward scores
Merge a checkpoint to create a deployable model
Export merged models to Hugging Face Hub

See Monitoring for details on tracking training progress and managing checkpoints.

Platform

Creating a Training Run

Configuration

Base Model

Dataset

Reward Functions

Tools

Hyperparameters

Reward Rubrics

How Reward Rubrics Work

Writing Reward Rubrics

Training Strategies

Standard Training

Continuous Training

Managing Training Runs

Starting and Stopping

Checkpoints

Next Steps

Monitoring

Workspace Settings

Platform

​Creating a Training Run

​Configuration

​Base Model

​Dataset

​Reward Functions

​Tools

​Hyperparameters

​Reward Rubrics

​How Reward Rubrics Work

Writing Reward Rubrics

​Training Strategies

​Standard Training

​Continuous Training

​Managing Training Runs

​Starting and Stopping

​Checkpoints

​Next Steps

Monitoring

Workspace Settings

Creating a Training Run

Configuration

Base Model

Dataset

Reward Functions

Tools

Hyperparameters

Reward Rubrics

How Reward Rubrics Work

Training Strategies

Standard Training

Continuous Training

Managing Training Runs

Starting and Stopping

Checkpoints

Next Steps