Skip to main content
The Grader class defines how your agent’s outputs are evaluated and scored. It produces the reward signal that drives reinforcement learning — higher rewards for better outputs, lower rewards for worse ones.

Grader Base Class

from osmosis_ai.rollout import Grader, GraderContext

class MyGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            # Evaluate sample and assign reward
            ctx.set_sample_reward(sample_id, 1.0)
The base class signature from the SDK:
class Grader(ABC):
    def __init__(self, config: GraderConfig | None = None):
        self.config = config

    @abstractmethod
    async def grade(self, ctx: GraderContext) -> Any:
        raise NotImplementedError
Like AgentWorkflow, the Grader has one abstract method — grade() — which receives a GraderContext containing the agent’s outputs and the reference answer for the current dataset row.

GraderContext

The ctx parameter passed to grade() provides:
FieldTypeDescription
ctx.labelstr | NoneReference answer for the current dataset row (typically your ground_truth column)
ctx.metadatadict[str, Any] | NonePer-row metadata from the dataset’s optional metadata column. None when the row has no metadata.
ctx.samplesdict[str, RolloutSample]Agent outputs keyed by sample ID
ctx.project_pathstr | NoneOptional project path supplied by the execution harness
ctx.artifactsdict[str, Any] | NoneOptional output JSON the grader attaches; starts as None
ctx.set_sample_reward(sample_id, reward)methodAssign a float reward to a sample
ctx.set_artifacts(artifacts)methodAttach an optional output JSON payload (see Artifacts)
The Grader runs whenever a dataset row has a label or metadata, so you can drive reward signals from metadata alone (for example, expected tool calls or per-row rubrics).
ctx.samples is keyed by sample source ID within a single workflow execution. With the built-in integrations, sample IDs usually come from the Strands agent name or the OpenAI Agents session name. Evaluation runs and training runs can still execute the workflow multiple times for the same prompt ([evaluation].n in evaluation configs, n_samples_per_prompt in training configs); each execution receives its own GraderContext.

set_sample_reward

Call ctx.set_sample_reward(sample_id, reward) to assign a reward to each sample. The reward should be a float — typically between 0.0 and 1.0, but any float value is accepted.
ctx.set_sample_reward(sample_id, 0.85)
set_sample_reward raises a ValueError if the sample_id is not found in ctx.samples. Always iterate over ctx.samples.items() to ensure you use valid sample IDs.

Artifacts

ctx.set_artifacts(artifacts) is an optional output channel for returning structured JSON alongside rewards — judge explanations, IDs, or pointers to larger traces the frontend can render. Use it when reward alone doesn’t capture why you scored a sample the way you did. Skip the call and nothing changes on the wire; existing callbacks stay byte-identical.
class JudgeGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            ctx.set_sample_reward(sample_id, 0.7)
        ctx.set_artifacts({
            "judge": {"explanation": "Missed the final constraint."},
            "trace_ref": {
                "path": "rollout_traces/run_123/sample_456.jsonl",
                "content_type": "application/jsonl",
                "size_bytes": 38291,
            },
        })
Rules to keep in mind:
  • Pass a JSON-serializable dict. Non-serializable values, NaN, or Infinity are rejected.
  • The payload is capped at 64 KiB after compact UTF‑8 JSON encoding.
  • Oversized or invalid payloads degrade to a small {"_error": {...}} marker so rewards always ship — sanitization never blocks reward delivery.
  • Don’t embed logs, traces, or binaries. Reference them by {path|url, content_type, size_bytes} and keep them in object storage instead.
ctx.metadata (input-side, read-only) and ctx.artifacts (output-side, set by you) are separate channels. Treat metadata as the dataset row, and artifacts as what you want the platform to display alongside the reward.

RolloutSample

Each entry in ctx.samples is a RolloutSample object containing the AgentWorkflow’s output:
from collections.abc import Mapping, Sequence
from typing import Any

from pydantic import BaseModel, Field


class RolloutSample(BaseModel):
    id: str
    messages: Sequence[Mapping[str, Any]] = Field(default_factory=list)
    label: str | None = None
    reward: float | None = None
    remove_sample: bool = False
    metrics: dict[str, Any] = Field(default_factory=dict)
    extra_fields: dict[str, Any] = Field(default_factory=dict)
The messages list is the conversation your workflow produced for that sample. In many graders, you only need to extract the final answer text from the last assistant message.
For real-world references, see rollouts/multiply-local-strands/main.py and rollouts/multiply-local-openai/main.py in the workspace-template repository. Those files are the source of truth for platform-created workspace repositories.

Implementation Patterns

Exact Match Grading

The simplest grading strategy is to compare the agent’s final text against ctx.label. The helper below extracts text from the last message:
from osmosis_ai.rollout import Grader, GraderContext


def _last_text(sample) -> str:
    """Extract the final text block from a sample's last message."""
    if not sample.messages:
        return ""
    content = sample.messages[-1].get("content", "")
    if isinstance(content, str):
        return content
    if isinstance(content, list):
        return next((b["text"] for b in content if isinstance(b, dict) and "text" in b), "")
    return ""


class ExactMatchGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            answer = _last_text(sample).strip()
            reward = 1.0 if ctx.label and answer == ctx.label.strip() else 0.0
            ctx.set_sample_reward(sample_id, reward)

LLM-as-Judge Grading

Use a separate LLM to evaluate the quality of agent outputs — useful when correctness is subjective or hard to check programmatically. Unlike the workflow, a grader runs off the training path, so you can call any LLM directly:
import litellm
from osmosis_ai.rollout import Grader, GraderContext


class LLMJudgeGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            agent_output = _last_text(sample)
            judge_response = await litellm.acompletion(
                model="openai/gpt-5.2",
                messages=[{
                    "role": "user",
                    "content": f"Rate this response from 0.0 to 1.0.\n\n"
                               f"Expected: {ctx.label}\n"
                               f"Actual: {agent_output}\n\n"
                               f"Score (just the number):"
                }],
            )
            score = float(judge_response.choices[0].message.content.strip())
            ctx.set_sample_reward(sample_id, max(0.0, min(1.0, score)))

Tool-Call Based Grading

Evaluate whether the agent made any tool calls, rather than just checking the final text output. Strands records tool invocations as toolUse content blocks on assistant messages:
from osmosis_ai.rollout import Grader, GraderContext


class ToolCallGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            used_tool = False
            for m in sample.messages:
                if m.get("role") != "assistant":
                    continue
                content = m.get("content") or []
                if isinstance(content, list) and any(
                    isinstance(b, dict) and "toolUse" in b for b in content
                ):
                    used_tool = True
                    break
            ctx.set_sample_reward(sample_id, 1.0 if used_tool else 0.0)
You can combine multiple grading strategies — for example, check that the agent used the right tools and produced a correct final answer, then weight the scores together.

GraderConfig

Custom grader configs follow the same pattern as AgentWorkflowConfig — extend GraderConfig and define a module-level config instance in your rollout entrypoint:
from osmosis_ai.rollout import Grader, GraderConfig, GraderContext

class MyGraderConfig(GraderConfig):
    name: str = "my-grader"
    partial_credit: bool = True
    similarity_threshold: float = 0.8

class MyGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        threshold = self.config.similarity_threshold if self.config else 0.8
        # ... use config values in grading logic ...

my_grader_config = MyGraderConfig()
Pass the config instance to LocalBackend(grader_config=my_grader_config). Evaluation and training TOML files do not currently set grader config fields directly. GraderConfig extends BaseConfig and includes the same concurrency field as AgentWorkflowConfig, but current backends do not use it to limit grader concurrency. Use evaluation [evaluation].batch_size, workflow/backend concurrency, or an explicit limiter inside the grader when your grader calls external services.
FieldTypeDefaultDescription
namestr(required)Identifier for the grader
descriptionstr | NoneNoneOptional description
concurrencyConcurrencyConfigunlimitedPresent on the config model; not currently enforced by LocalBackend

Auto-Discovery

Like AgentWorkflow, osmosis train submit preflight can discover your Grader subclass from the entrypoint module. No registration decorator is needed, but your rollout entrypoint still passes the grader class and optional config to the backend it constructs.
osmosis train submit requires a concrete Grader in the rollout entrypoint. If the SDK finds no Grader, preflight validation fails instead of assigning a default reward.

Next Steps

Evaluation

Submit an evaluation run to test your AgentWorkflow and Grader before a training run.