Skip to main content
Welcome to Osmosis — the forward-deployed reinforcement learning platform. Osmosis helps companies create task-specific models that beat foundation models at a fraction of the cost.

Why Osmosis

Hands-on Deployments

We work directly with customers to support the entire post-training workflow — from feature engineering to reward function creation.

Reinforcement Fine-Tuning

A comprehensive post-training platform that allows engineers to leverage cutting-edge reinforcement learning techniques (GRPO, DAPO, etc.).

Continuous Improvement

Integrate with your evaluation solutions to monitor performance and automatically start re-training runs — without the need for an engineer in the loop.

Use Cases

Further reading: Open Source SLM Trained for MCP — see how Osmosis trained a small language model for tool use with reinforcement learning. Visit osmosis.ai for more use cases and demos.

How It Works

1

You Define

Provide the building blocks for training:
  • Tools & Agent Logic — the actions your agent can take
  • Reward Functions — how outputs are scored
  • Training Data — the tasks your model learns from
2

Osmosis Trains

The platform handles the heavy lifting:
  • GPU Training Cluster — managed infrastructure, no setup needed
  • RL Training Loop — GRPO, DAPO, and multi-turn tool training
  • Checkpoints & Metrics — track progress in real time
3

Deploy Your Model

Ship a model that’s better at your tasks:
  • Merge to HuggingFace — export trained weights
  • Deploy Anywhere — use your model in any environment

Get Started

What is a Rollout?

In reinforcement learning, a rollout is the process of running a policy in an environment to generate a trajectory — the complete sequence of actions, observations, and outcomes from start to finish. In the LLM context, a rollout is a single attempt by the model to solve a task, including any reasoning steps, tool usage, and final output. Think of it like a single ChatGPT conversation: if multiple users ask the same question to the same model, each interaction counts as a separate rollout. Each rollout produces a trajectory that captures everything the model did during that attempt. A reward function then scores how well the model performed. Osmosis collects these trajectories and rewards, then uses reinforcement learning (GRPO, DAPO) to update the model’s policy — nudging it toward strategies that earn higher rewards. By running thousands of rollouts per training iteration, the model discovers which reasoning patterns, tool-use strategies, and response styles lead to better outcomes — and improves measurably on your specific tasks over time.

Choose Your Workflow

Osmosis supports two main workflows for connecting your code to the training platform:
Local RolloutRemote Rollout
Best forReward functions, rubrics, MCP toolsCustom agent loops with complex logic
How it worksPush to GitHub → auto-synced to platformRun your own HTTP server → platform connects
SetupAdd decorators + folder structureImplement RolloutAgentLoop
When to useStandard tool-use trainingMulti-step reasoning, custom environments