Grader class defines how your agent’s outputs are evaluated and scored. It produces the reward signal that drives reinforcement learning — higher rewards for better outputs, lower rewards for worse ones.
Grader Base Class
AgentWorkflow, the Grader has one abstract method — grade() — which receives a GraderContext containing the agent’s outputs and the reference answer for the current dataset row.
GraderContext
Thectx parameter passed to grade() provides:
| Field | Type | Description |
|---|---|---|
ctx.label | str | None | Reference answer for the current dataset row (typically your ground_truth column) |
ctx.metadata | dict[str, Any] | None | Per-row metadata from the dataset’s optional metadata column. None when the row has no metadata. |
ctx.samples | dict[str, RolloutSample] | Agent outputs keyed by sample ID |
ctx.project_path | str | None | Optional project path supplied by the execution harness |
ctx.artifacts | dict[str, Any] | None | Optional output JSON the grader attaches; starts as None |
ctx.set_sample_reward(sample_id, reward) | method | Assign a float reward to a sample |
ctx.set_artifacts(artifacts) | method | Attach an optional output JSON payload (see Artifacts) |
The Grader runs whenever a dataset row has a
label or metadata, so you can drive reward signals from metadata alone (for example, expected tool calls or per-row rubrics).ctx.samples is keyed by sample source ID within a single workflow execution. With the built-in integrations, sample IDs usually come from the Strands agent name or the OpenAI Agents session name. Evaluation runs and training runs can still execute the workflow multiple times for the same prompt ([evaluation].n in evaluation configs, n_samples_per_prompt in training configs); each execution receives its own GraderContext.set_sample_reward
Call ctx.set_sample_reward(sample_id, reward) to assign a reward to each sample. The reward should be a float — typically between 0.0 and 1.0, but any float value is accepted.
Artifacts
ctx.set_artifacts(artifacts) is an optional output channel for returning structured JSON alongside rewards — judge explanations, IDs, or pointers to larger traces the frontend can render. Use it when reward alone doesn’t capture why you scored a sample the way you did. Skip the call and nothing changes on the wire; existing callbacks stay byte-identical.
- Pass a JSON-serializable
dict. Non-serializable values,NaN, orInfinityare rejected. - The payload is capped at 64 KiB after compact UTF‑8 JSON encoding.
- Oversized or invalid payloads degrade to a small
{"_error": {...}}marker so rewards always ship — sanitization never blocks reward delivery. - Don’t embed logs, traces, or binaries. Reference them by
{path|url, content_type, size_bytes}and keep them in object storage instead.
ctx.metadata (input-side, read-only) and ctx.artifacts (output-side, set by you) are separate channels. Treat metadata as the dataset row, and artifacts as what you want the platform to display alongside the reward.RolloutSample
Each entry inctx.samples is a RolloutSample object containing the AgentWorkflow’s output:
messages list is the conversation your workflow produced for that sample. In many graders, you only need to extract the final answer text from the last assistant message.
Implementation Patterns
Exact Match Grading
The simplest grading strategy is to compare the agent’s final text againstctx.label. The helper below extracts text from the last message:
LLM-as-Judge Grading
Use a separate LLM to evaluate the quality of agent outputs — useful when correctness is subjective or hard to check programmatically. Unlike the workflow, a grader runs off the training path, so you can call any LLM directly:Tool-Call Based Grading
Evaluate whether the agent made any tool calls, rather than just checking the final text output. Strands records tool invocations astoolUse content blocks on assistant messages:
GraderConfig
Custom grader configs follow the same pattern asAgentWorkflowConfig — extend GraderConfig and define a module-level config instance in your rollout entrypoint:
LocalBackend(grader_config=my_grader_config). Evaluation and training TOML files do not currently set grader config fields directly.
GraderConfig extends BaseConfig and includes the same concurrency field as AgentWorkflowConfig, but current backends do not use it to limit grader concurrency. Use evaluation [evaluation].batch_size, workflow/backend concurrency, or an explicit limiter inside the grader when your grader calls external services.
| Field | Type | Default | Description |
|---|---|---|---|
name | str | (required) | Identifier for the grader |
description | str | None | None | Optional description |
concurrency | ConcurrencyConfig | unlimited | Present on the config model; not currently enforced by LocalBackend |
Auto-Discovery
LikeAgentWorkflow, osmosis train submit preflight can discover your Grader subclass from the entrypoint module. No registration decorator is needed, but your rollout entrypoint still passes the grader class and optional config to the backend it constructs.
Next Steps
Evaluation
Submit an evaluation run to test your AgentWorkflow and Grader before a training run.