TheDocumentation Index
Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt
Use this file to discover all available pages before exploring further.
Grader class defines how your agent’s outputs are evaluated and scored. It produces the reward signal that drives reinforcement learning — higher rewards for better outputs, lower rewards for worse ones.
Grader Base Class
AgentWorkflow, the Grader has one abstract method — grade() — which receives a GraderContext containing the agent’s outputs and the reference answer for the current dataset row.
GraderContext
Thectx parameter passed to grade() provides:
| Field | Type | Description |
|---|---|---|
ctx.label | str | None | Reference answer for the current dataset row (typically your ground_truth column) |
ctx.samples | dict[str, RolloutSample] | Agent outputs keyed by sample ID |
ctx.workspace_path | str | None | Path to the workspace root directory |
ctx.set_sample_reward(sample_id, reward) | method | Assign a float reward to a sample |
ctx.samples is a dictionary because the training cluster may run multiple rollouts per prompt (controlled by the num_samples training parameter). Each sample represents one independent execution of your AgentWorkflow for the same prompt.set_sample_reward
Call ctx.set_sample_reward(sample_id, reward) to assign a reward to each sample. The reward should be a float — typically between 0.0 and 1.0, but any float value is accepted.
RolloutSample
Each entry inctx.samples is a RolloutSample object containing the AgentWorkflow’s output:
messages list is the conversation your workflow produced for that sample. In many graders, you only need to extract the final answer text from the last assistant message.
Implementation Patterns
Exact Match Grading
The simplest grading strategy is to compare the agent’s final text againstctx.label. The helper below extracts text from the last message:
LLM-as-Judge Grading
Use a separate LLM to evaluate the quality of agent outputs — useful when correctness is subjective or hard to check programmatically. Unlike the workflow, a grader runs off the training path, so you can call any LLM directly:Tool-Call Based Grading
Evaluate whether the agent made any tool calls, rather than just checking the final text output. Strands records tool invocations astoolUse content blocks on assistant messages:
GraderConfig
Custom grader configs follow the same pattern asAgentWorkflowConfig — extend GraderConfig to add custom fields:
GraderConfig extends BaseConfig and includes the same concurrency field as AgentWorkflowConfig:
| Field | Type | Default | Description |
|---|---|---|---|
name | str | (required) | Identifier for the grader |
description | str | None | None | Optional description |
concurrency | ConcurrencyConfig | unlimited | Controls max concurrent grading operations |
Auto-Discovery
LikeAgentWorkflow, the SDK automatically discovers your Grader subclass from the entrypoint module. No registration needed.
A Grader is optional — your entrypoint can define zero or one
Grader subclass. If no Grader is defined, all samples receive a default reward. This can be useful during early development when you want to focus on agent behavior before adding evaluation logic.Next Steps
Local Evaluation
Test your AgentWorkflow and Grader locally with eval mode.
Execution Backends
Run rollouts locally or on remote infrastructure.