> ## Documentation Index
> Fetch the complete documentation index at: https://docs.osmosis.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation

> Submit evaluation runs and inspect results from your workspace directory

An evaluation run submits against a platform dataset using the same workspace, rollout, entrypoint, dataset, and optional `commit_sha` semantics as [`osmosis train submit`](/cli/command-reference#train-submit). The platform clones the repository identified by the workspace directory's `origin` remote and executes the rollout server-side, so push your changes and confirm [Git Sync](/cli/workspace/git-sync) before submitting.

Evaluation configs must live under `configs/eval/` inside a structured Osmosis workspace directory.

<Note>
  `osmosis eval submit` is also the recommended pre-flight before a training run — run it first to catch problems before committing GPU time.
</Note>

## Quick Start

From inside your workspace directory:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
osmosis dataset list                                # confirm the platform dataset name
git push                                            # make sure the platform sees your commit
osmosis eval submit configs/eval/my-rollout.toml
```

Then inspect or manage the run:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
osmosis eval list
osmosis eval info <name>
osmosis eval stop <name>
```

## Evaluation Config

See [Config Files](/cli/config-files#eval-config) for the full field reference.

```toml configs/eval/my-rollout.toml theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
[experiment]
rollout = "my-rollout"                # Rollout directory under rollouts/
entrypoint = "main.py"                # Entrypoint relative to the rollout directory
model_path = "openai/gpt-5-mini"      # LiteLLM-style model name for the evaluation policy
dataset = "my-platform-dataset"       # Platform dataset name from `osmosis dataset list`
# commit_sha =                        # Optional: pin to a specific commit

[evaluation]
# Optional. Omit values to use platform defaults.
# limit = 200
# n = 1
# batch_size = 1
# pass_threshold = 1.0
# agent_workflow_timeout_s = 450
# grader_timeout_s = 150

# [env]
# LOG_LEVEL = "INFO"

[secrets]
# Required for eval configs. Use required = [] only when no secrets are needed.
required = ["OPENAI_API_KEY"]
```

<Note>
  When `[evaluation].limit` is omitted, the platform evaluates a random 10% sample of the dataset (at least one row). Set `limit` to evaluate a fixed number of rows — the first `N` rows of the dataset, in order.
</Note>

<Warning>
  Git Sync is the source of truth for your rollout code. The CLI reads config values from the local TOML file you pass, but rollout code comes from the synced workspace repository. Commit, push, and wait for sync before submitting code changes; set `commit_sha` when you need a specific synced revision.
</Warning>

## How It Works

<Steps>
  <Step title="Resolve workspace and config">
    The CLI reads the evaluation TOML, resolves the workspace from the Git `origin` remote, and validates the `[experiment]` and `[secrets]` sections (plus optional `[evaluation]` and `[env]`) locally before submitting.
  </Step>

  <Step title="Submit to the platform">
    The CLI submits the evaluation run request. The platform clones the connected workspace repository (or the pinned `commit_sha`) and prepares the evaluation environment.
  </Step>

  <Step title="Validate the model">
    Before evaluating any rows, the platform runs a pre-flight check that confirms `[experiment].model_path` is reachable with your configured credentials. If the model is unreachable — wrong name, missing or invalid API key, or provider rate limiting — the run fails early instead of consuming evaluation resources. Provide the model's provider API key by registering it with [`osmosis secret set`](/cli/command-reference#secret) and listing it under `[secrets].required` (see [Configuration Files](/cli/config-files#env-and-secrets)).
  </Step>

  <Step title="Run the rollout server-side">
    The platform starts your rollout, drives `AgentWorkflow.run(ctx)` for each selected row of the platform dataset using `[experiment].model_path` as the evaluation policy, then runs `Grader.grade(ctx)` against the row's `ground_truth`.
  </Step>

  <Step title="Aggregate results">
    The platform aggregates rewards, pass rates, and per-row results. Use `osmosis eval info <name>` (or `osmosis --json eval info <name>`) to inspect them.
  </Step>
</Steps>

## Commands

| Command                                     | Description                                                                     |
| ------------------------------------------- | ------------------------------------------------------------------------------- |
| `osmosis eval submit <config>.toml [--yes]` | Submit an evaluation run from a TOML under `configs/eval/`.                     |
| `osmosis eval list [--limit N] [--all]`     | List evaluation runs for the current workspace directory.                       |
| `osmosis eval info <name-or-id>`            | Show details and results for a specific evaluation run.                         |
| `osmosis eval stop <name-or-id> [--yes]`    | Stop a pending or running evaluation run.                                       |
| `osmosis eval rubric`                       | Local LLM-as-judge over a JSONL conversation file. Does not touch the platform. |

See the [Command Reference](/cli/command-reference#eval) for the full flag list.

## From Evaluation Run to Training Run

<Steps>
  <Step title="Submit an evaluation run">
    Run `osmosis eval submit configs/eval/my-rollout.toml`. Use `osmosis eval list` and `osmosis eval info <name>` to track progress and inspect results.
  </Step>

  <Step title="Iterate on rollout code">
    Push fixes to the workspace repository and re-submit. `commit_sha` lets you re-run the same evaluation run against an older revision when comparing changes.
  </Step>

  <Step title="Submit a training run">
    Once evaluation run results look healthy, run `osmosis train submit configs/training/my-rollout.toml`. See [Training Runs](/platform/training-runs).
  </Step>
</Steps>

## Local Rubric Scoring

`osmosis eval rubric` is a local utility for scoring an existing JSONL conversation file with an LLM judge. It does not require a workspace directory or platform authentication, and it does not run a rollout.

```bash theme={"theme":{"light":"github-light","dark":"github-dark"},"languages":{"custom":["/languages/cli.json"]}}
osmosis eval rubric -d conversations.jsonl \
  --rubric "Evaluate the assistant's helpfulness..." \
  --model openai/gpt-5-mini
```

See the [Command Reference](/cli/command-reference#eval-rubric) for the full flag list.

## Next Steps

<CardGroup cols={2}>
  <Card title="Config Files" icon="file-lines" href="/cli/config-files">
    Full reference for evaluation and training configuration files.
  </Card>

  <Card title="Git Sync" icon="rotate" href="/cli/workspace/git-sync">
    Push and sync rollout code before submitting evaluation runs or training runs.
  </Card>

  <Card title="Training Runs" icon="rocket" href="/platform/training-runs">
    Submit a training run once evaluation run results look good.
  </Card>
</CardGroup>
