> ## Documentation Index
> Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Building Graders

> Implement the Grader class to define reward signals for training

The `Grader` class defines how your agent's outputs are evaluated and scored. It produces the reward signal that drives reinforcement learning — higher rewards for better outputs, lower rewards for worse ones.

## Grader Base Class

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
from osmosis_ai.rollout import Grader, GraderContext

class MyGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            # Evaluate sample and assign reward
            ctx.set_sample_reward(sample_id, 1.0)
```

The base class signature from the SDK:

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
class Grader(ABC):
    def __init__(self, config: GraderConfig | None = None):
        self.config = config

    @abstractmethod
    async def grade(self, ctx: GraderContext) -> Any:
        raise NotImplementedError
```

Like `AgentWorkflow`, the `Grader` has one abstract method — `grade()` — which receives a `GraderContext` containing the agent's outputs and the reference answer for the current dataset row.

## GraderContext

The `ctx` parameter passed to `grade()` provides:

| Field                                      | Type                       | Description                                                                                          |
| ------------------------------------------ | -------------------------- | ---------------------------------------------------------------------------------------------------- |
| `ctx.label`                                | `str \| None`              | Reference answer for the current dataset row (typically your `ground_truth` column)                  |
| `ctx.metadata`                             | `dict[str, Any] \| None`   | Per-row metadata from the dataset's optional `metadata` column. `None` when the row has no metadata. |
| `ctx.samples`                              | `dict[str, RolloutSample]` | Agent outputs keyed by sample ID                                                                     |
| `ctx.project_path`                         | `str \| None`              | Optional project path supplied by the execution harness                                              |
| `ctx.artifacts`                            | `dict[str, Any] \| None`   | Optional output JSON the grader attaches; starts as `None`                                           |
| `ctx.set_sample_reward(sample_id, reward)` | method                     | Assign a float reward to a sample                                                                    |
| `ctx.set_artifacts(artifacts)`             | method                     | Attach an optional output JSON payload (see [Artifacts](#artifacts))                                 |

<Note>
  The Grader runs whenever a dataset row has a `label` **or** `metadata`, so you can drive reward signals from metadata alone (for example, expected tool calls or per-row rubrics).
</Note>

<Note>
  `ctx.samples` is keyed by sample source ID within a single workflow execution. With the built-in integrations, sample IDs usually come from the Strands agent name or the OpenAI Agents session name. Evaluation runs and training runs can still execute the workflow multiple times for the same prompt (`[evaluation].n` in evaluation configs, `n_samples_per_prompt` in training configs); each execution receives its own `GraderContext`.
</Note>

### `set_sample_reward`

Call `ctx.set_sample_reward(sample_id, reward)` to assign a reward to each sample. The reward should be a float — typically between 0.0 and 1.0, but any float value is accepted.

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
ctx.set_sample_reward(sample_id, 0.85)
```

<Warning>
  `set_sample_reward` raises a `ValueError` if the `sample_id` is not found in `ctx.samples`. Always iterate over `ctx.samples.items()` to ensure you use valid sample IDs.
</Warning>

## Artifacts

`ctx.set_artifacts(artifacts)` is an optional output channel for returning structured JSON alongside rewards — judge explanations, IDs, or pointers to larger traces the frontend can render. Use it when reward alone doesn't capture why you scored a sample the way you did. Skip the call and nothing changes on the wire; existing callbacks stay byte-identical.

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
class JudgeGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            ctx.set_sample_reward(sample_id, 0.7)
        ctx.set_artifacts({
            "judge": {"explanation": "Missed the final constraint."},
            "trace_ref": {
                "path": "rollout_traces/run_123/sample_456.jsonl",
                "content_type": "application/jsonl",
                "size_bytes": 38291,
            },
        })
```

Rules to keep in mind:

* Pass a JSON-serializable `dict`. Non-serializable values, `NaN`, or `Infinity` are rejected.
* The payload is capped at **64 KiB** after compact UTF‑8 JSON encoding.
* Oversized or invalid payloads degrade to a small `{"_error": {...}}` marker so rewards always ship — sanitization never blocks reward delivery.
* Don't embed logs, traces, or binaries. Reference them by `{path|url, content_type, size_bytes}` and keep them in object storage instead.

<Note>
  `ctx.metadata` (input-side, read-only) and `ctx.artifacts` (output-side, set by you) are separate channels. Treat `metadata` as the dataset row, and `artifacts` as what you want the platform to display alongside the reward.
</Note>

## RolloutSample

Each entry in `ctx.samples` is a `RolloutSample` object containing the AgentWorkflow's output:

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
from collections.abc import Mapping, Sequence
from typing import Any

from pydantic import BaseModel, Field


class RolloutSample(BaseModel):
    id: str
    messages: Sequence[Mapping[str, Any]] = Field(default_factory=list)
    label: str | None = None
    reward: float | None = None
    remove_sample: bool = False
    metrics: dict[str, Any] = Field(default_factory=dict)
    extra_fields: dict[str, Any] = Field(default_factory=dict)
```

The `messages` list is the conversation your workflow produced for that sample. In many graders, you only need to extract the final answer text from the last assistant message.

<Tip>
  For real-world references, see `rollouts/multiply-local-strands/main.py` and `rollouts/multiply-local-openai/main.py` in the `workspace-template` repository. Those files are the source of truth for platform-created workspace repositories.
</Tip>

## Implementation Patterns

### Exact Match Grading

The simplest grading strategy is to compare the agent's final text against `ctx.label`. The helper below extracts text from the last message:

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
from osmosis_ai.rollout import Grader, GraderContext


def _last_text(sample) -> str:
    """Extract the final text block from a sample's last message."""
    if not sample.messages:
        return ""
    content = sample.messages[-1].get("content", "")
    if isinstance(content, str):
        return content
    if isinstance(content, list):
        return next((b["text"] for b in content if isinstance(b, dict) and "text" in b), "")
    return ""


class ExactMatchGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            answer = _last_text(sample).strip()
            reward = 1.0 if ctx.label and answer == ctx.label.strip() else 0.0
            ctx.set_sample_reward(sample_id, reward)
```

### LLM-as-Judge Grading

Use a separate LLM to evaluate the quality of agent outputs — useful when correctness is subjective or hard to check programmatically. Unlike the workflow, a grader runs off the training path, so you can call any LLM directly:

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
import litellm
from osmosis_ai.rollout import Grader, GraderContext


class LLMJudgeGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            agent_output = _last_text(sample)
            judge_response = await litellm.acompletion(
                model="openai/gpt-5.2",
                messages=[{
                    "role": "user",
                    "content": f"Rate this response from 0.0 to 1.0.\n\n"
                               f"Expected: {ctx.label}\n"
                               f"Actual: {agent_output}\n\n"
                               f"Score (just the number):"
                }],
            )
            score = float(judge_response.choices[0].message.content.strip())
            ctx.set_sample_reward(sample_id, max(0.0, min(1.0, score)))
```

### Tool-Call Based Grading

Evaluate whether the agent made any tool calls, rather than just checking the final text output. Strands records tool invocations as `toolUse` content blocks on assistant messages:

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
from osmosis_ai.rollout import Grader, GraderContext


class ToolCallGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        for sample_id, sample in ctx.samples.items():
            used_tool = False
            for m in sample.messages:
                if m.get("role") != "assistant":
                    continue
                content = m.get("content") or []
                if isinstance(content, list) and any(
                    isinstance(b, dict) and "toolUse" in b for b in content
                ):
                    used_tool = True
                    break
            ctx.set_sample_reward(sample_id, 1.0 if used_tool else 0.0)
```

<Tip>
  You can combine multiple grading strategies — for example, check that the agent used the right tools **and** produced a correct final answer, then weight the scores together.
</Tip>

## GraderConfig

Custom grader configs follow the same pattern as `AgentWorkflowConfig` — extend `GraderConfig` and define a module-level config instance in your rollout entrypoint:

```python theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
from osmosis_ai.rollout import Grader, GraderConfig, GraderContext

class MyGraderConfig(GraderConfig):
    name: str = "my-grader"
    partial_credit: bool = True
    similarity_threshold: float = 0.8

class MyGrader(Grader):
    async def grade(self, ctx: GraderContext) -> None:
        threshold = self.config.similarity_threshold if self.config else 0.8
        # ... use config values in grading logic ...

my_grader_config = MyGraderConfig()
```

Pass the config instance to `LocalBackend(grader_config=my_grader_config)`. Evaluation and training TOML files do not currently set grader config fields directly.

`GraderConfig` extends `BaseConfig` and includes the same `concurrency` field as `AgentWorkflowConfig`, but current backends do not use it to limit grader concurrency. Use evaluation `[evaluation].batch_size`, workflow/backend concurrency, or an explicit limiter inside the grader when your grader calls external services.

| Field         | Type                | Default    | Description                                                           |
| ------------- | ------------------- | ---------- | --------------------------------------------------------------------- |
| `name`        | `str`               | (required) | Identifier for the grader                                             |
| `description` | `str \| None`       | `None`     | Optional description                                                  |
| `concurrency` | `ConcurrencyConfig` | unlimited  | Present on the config model; not currently enforced by `LocalBackend` |

## Auto-Discovery

Like `AgentWorkflow`, `osmosis train submit` preflight can discover your `Grader` subclass from the entrypoint module. No registration decorator is needed, but your rollout entrypoint still passes the grader class and optional config to the backend it constructs.

<Warning>
  `osmosis train submit` requires a concrete `Grader` in the rollout entrypoint. If the SDK finds no `Grader`, preflight validation fails instead of assigning a default reward.
</Warning>

## Next Steps

<CardGroup cols={2}>
  <Card title="Evaluation" icon="flask-vial" href="/cli/rollout/eval">
    Submit an evaluation run to test your AgentWorkflow and Grader before a training run.
  </Card>
</CardGroup>
