Skip to main content

Osmosis AI Python SDK

The Osmosis AI Python SDK provides tools for evaluating LLM outputs using natural language rubrics and reward functions. It supports both local deterministic evaluation and remote LLM-based semantic evaluation across multiple providers.

What is Osmosis AI?

Osmosis AI helps you:
  • Evaluate LLM outputs with natural language rubrics
  • Create reward functions for reinforcement learning and scoring
  • Compare providers across OpenAI, Anthropic, Gemini, xAI, and more
  • Batch process evaluations with a built-in CLI tool

Core Concepts

Two Evaluation Approaches

1. Local Reward Functions - Fast, deterministic scoring
from osmosis_ai import osmosis_reward

@osmosis_reward
def exact_match(solution_str: str, ground_truth: str, extra_info: dict = None) -> float:
    return 1.0 if solution_str.strip() == ground_truth.strip() else 0.0
2. Remote Rubric Evaluation - LLM-powered semantic judgment
from osmosis_ai import osmosis_rubric, evaluate_rubric

@osmosis_rubric
def helpfulness_check(solution_str: str, ground_truth: str | None, extra_info: dict) -> float:
    return evaluate_rubric(
        rubric="Evaluate how helpful and clear the response is.",
        solution_str=solution_str,
        model_info={"provider": "openai", "model": "gpt-5"}
    )

score = helpfulness_check(
    solution_str="You can reset your password by clicking 'Forgot Password'.",
    ground_truth=None,
    extra_info={}
)

CLI Tool

Batch process datasets with progress tracking:
osmosis eval --rubric <rubric_id> --data <path_to_data>
Rubrics are defined in rubric_configs.yaml for reusability across evaluations.

Key Features

LLM-Based Rubrics

Natural language evaluation criteria with semantic understanding

Local Reward Functions

Fast, deterministic functions for exact match and simple checks

Multi-Provider Support

OpenAI, Anthropic, Gemini, xAI, OpenRouter, and Cerebras

CLI Tool

Batch evaluations with statistics and result tracking

Use Cases

  • Quality Assurance - Evaluate LLM responses before serving to users
  • Model Comparison - Compare outputs across models and providers
  • Reinforcement Learning - Create reward functions for training
  • A/B Testing - Measure impact of prompt variations

Next Steps