Building Graders

The Grader class defines how your agent’s outputs are evaluated and scored. It produces the reward signal that drives reinforcement learning — higher rewards for better outputs, lower rewards for worse ones.

Grader Base Class

from osmosis_ai.rollout import Grader, GraderContext

class MyGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            # Evaluate sample and assign reward
            ctx.set_sample_reward(sample_id, 1.0)

The base class signature from the SDK:

class Grader(ABC):
    def __init__(self, config: GraderConfig | None = None):
        self.config = config

    @abstractmethod
    async def grade(self, ctx: GraderContext) -> Any:
        raise NotImplementedError

Like AgentWorkflow, the Grader has one abstract method — grade() — which receives a GraderContext containing the agent’s outputs and the reference answer for the current dataset row.

GraderContext

The ctx parameter passed to grade() provides:

Field	Type	Description
`ctx.label`	`str \| None`	Reference answer for the current dataset row (typically your `ground_truth` column)
`ctx.samples`	`dict[str, RolloutSample]`	Agent outputs keyed by sample ID
`ctx.workspace_path`	`str \| None`	Path to the workspace root directory
`ctx.set_sample_reward(sample_id, reward)`	method	Assign a float reward to a sample

ctx.samples is a dictionary because the training cluster may run multiple rollouts per prompt (controlled by the num_samples training parameter). Each sample represents one independent execution of your AgentWorkflow for the same prompt.

`set_sample_reward`

Call ctx.set_sample_reward(sample_id, reward) to assign a reward to each sample. The reward should be a float — typically between 0.0 and 1.0, but any float value is accepted.

ctx.set_sample_reward(sample_id, 0.85)

set_sample_reward raises a ValueError if the sample_id is not found in ctx.samples. Always iterate over ctx.samples.items() to ensure you use valid sample IDs.

RolloutSample

Each entry in ctx.samples is a RolloutSample object containing the AgentWorkflow’s output:

class RolloutSample(BaseModel):
    id: str                              # Unique sample identifier
    messages: list[MessageDict]          # Conversation messages produced by the agent
    reward: float | None = None          # Reward (set by Grader)

The messages list is the conversation your workflow produced for that sample. In many graders, you only need to extract the final answer text from the last assistant message.

For a real-world reference, see examples/rollout/multiply_rollout/grader.py in the osmosis-sdk-python repository — it shows the canonical pattern for extracting text out of a sample’s last message.

Implementation Patterns

Exact Match Grading

The simplest grading strategy is to compare the agent’s final text against ctx.label. The helper below extracts text from the last message:

from osmosis_ai.rollout import Grader, GraderContext


def _last_text(sample) -> str:
    """Extract the final text block from a sample's last message."""
    if not sample.messages:
        return ""
    content = sample.messages[-1].get("content", "")
    if isinstance(content, str):
        return content
    if isinstance(content, list):
        return next((b["text"] for b in content if isinstance(b, dict) and "text" in b), "")
    return ""


class ExactMatchGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            answer = _last_text(sample).strip()
            reward = 1.0 if ctx.label and answer == ctx.label.strip() else 0.0
            ctx.set_sample_reward(sample_id, reward)

LLM-as-Judge Grading

Use a separate LLM to evaluate the quality of agent outputs — useful when correctness is subjective or hard to check programmatically. Unlike the workflow, a grader runs off the training path, so you can call any LLM directly:

import litellm
from osmosis_ai.rollout import Grader, GraderContext


class LLMJudgeGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            agent_output = _last_text(sample)
            judge_response = await litellm.acompletion(
                model="openai/gpt-5.2",
                messages=[{
                    "role": "user",
                    "content": f"Rate this response from 0.0 to 1.0.\n\n"
                               f"Expected: {ctx.label}\n"
                               f"Actual: {agent_output}\n\n"
                               f"Score (just the number):"
                }],
            )
            score = float(judge_response.choices[0].message.content.strip())
            ctx.set_sample_reward(sample_id, max(0.0, min(1.0, score)))

Tool-Call Based Grading

Evaluate whether the agent made any tool calls, rather than just checking the final text output. Strands records tool invocations as toolUse content blocks on assistant messages:

from osmosis_ai.rollout import Grader, GraderContext


class ToolCallGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            used_tool = False
            for m in sample.messages:
                if m.get("role") != "assistant":
                    continue
                content = m.get("content") or []
                if isinstance(content, list) and any(
                    isinstance(b, dict) and "toolUse" in b for b in content
                ):
                    used_tool = True
                    break
            ctx.set_sample_reward(sample_id, 1.0 if used_tool else 0.0)

You can combine multiple grading strategies — for example, check that the agent used the right tools and produced a correct final answer, then weight the scores together.

GraderConfig

Custom grader configs follow the same pattern as AgentWorkflowConfig — extend GraderConfig to add custom fields:

from osmosis_ai.rollout import Grader, GraderConfig, GraderContext

class MyGraderConfig(GraderConfig):
    name: str = "my-grader"
    partial_credit: bool = True
    similarity_threshold: float = 0.8

class MyGrader(Grader):
    def __init__(self):
        super().__init__(config=MyGraderConfig())

    async def grade(self, ctx: GraderContext) -> None:
        threshold = self.config.similarity_threshold if self.config else 0.8
        # ... use config values in grading logic ...

GraderConfig extends BaseConfig and includes the same concurrency field as AgentWorkflowConfig:

Field	Type	Default	Description
`name`	`str`	(required)	Identifier for the grader
`description`	`str \| None`	`None`	Optional description
`concurrency`	`ConcurrencyConfig`	unlimited	Controls max concurrent grading operations

Auto-Discovery

Like AgentWorkflow, the SDK automatically discovers your Grader subclass from the entrypoint module. No registration needed.

A Grader is optional — your entrypoint can define zero or one Grader subclass. If no Grader is defined, all samples receive a default reward. This can be useful during early development when you want to focus on agent behavior before adding evaluation logic.

CLI

Workspace

Rollout

Building Graders

Grader Base Class

GraderContext

`set_sample_reward`

RolloutSample

Implementation Patterns

Exact Match Grading

LLM-as-Judge Grading

Tool-Call Based Grading

GraderConfig

Auto-Discovery

Next Steps

Local Evaluation

Execution Backends

CLI

Workspace

Rollout

Documentation Index

​Grader Base Class

​GraderContext

​set_sample_reward

​RolloutSample

​Implementation Patterns

​Exact Match Grading

​LLM-as-Judge Grading

​Tool-Call Based Grading

​GraderConfig

​Auto-Discovery

​Next Steps

Local Evaluation

Execution Backends

Grader Base Class

GraderContext

`set_sample_reward`

RolloutSample

Implementation Patterns

Exact Match Grading

LLM-as-Judge Grading

Tool-Call Based Grading

GraderConfig

Auto-Discovery

Next Steps