Datasets

Datasets provide the prompts and optional reference answers that drive evaluation runs and training runs. Each row becomes an example that your rollout and Grader process.

Dataset Format

Osmosis accepts datasets in JSONL, CSV, or Parquet format, up to 5 GB per file. Each dataset must contain at least 4 rows.

Required Columns

Column	Description
`system_prompt`	The system prompt provided to the model for this example.
`user_prompt`	The user prompt or question the model must respond to.

Optional Columns

Column	Description
`ground_truth`	The expected correct answer or reference output. The platform UI also accepts `label` as an alias for this column. When present, the value is passed to your Grader as `context.label`.
`metadata`	Per-row JSON object exposed to your AgentWorkflow and Grader as `ctx.metadata`. Use it to attach context the model or grader needs (such as tags, identifiers, or expected tool calls) without baking it into the prompt.

Include ground_truth (or label) when your Grader needs a reference answer to score against. Datasets that drive reward functions based purely on model behavior can omit it. Rows with only metadata (no ground_truth) still run through the Grader.

Metadata Validation Rules

osmosis dataset upload and osmosis dataset validate enforce the following rules on the metadata column for CSV, JSONL, and Parquet:

Each cell must be a JSON object (a dictionary). The CLI parses CSV cells and JSONL strings as JSON, and Parquet accepts a struct column, a null column, or a JSON-object string column.
Nested empty objects ({} inside the top-level object) fail validation. A top-level {} is fine for individual rows, but every sampled row cannot be an empty object.
Value types for each key must stay consistent across rows. For example, metadata.tag cannot be a string in one row and a number in another.
Integer values must fit in a signed 64-bit range.
The CLI treats empty strings and missing values as absent and skips them during validation.

Example JSONL

{"system_prompt": "You are a helpful math tutor.", "user_prompt": "What is 15 * 23?", "ground_truth": "345"}
{"system_prompt": "You are a helpful math tutor.", "user_prompt": "Simplify 3/9.", "ground_truth": "1/3"}

Upload a Dataset

osmosis dataset upload data/train.jsonl

The uploaded dataset is named from the file stem (train in this example). After upload, the dataset enters a processing pipeline. You can check its status:

osmosis dataset info <dataset-name>

Status	Description
uploading	File upload has started and is not complete yet.
pending	Upload received, waiting to be processed.
processing	Dataset is being validated and indexed.
uploaded	Dataset is ready for use in evaluation runs and training runs.
error	Processing failed — check column names and file format.
cancelled	Upload was cancelled before processing completed.

Validate Locally

Before uploading, validate your dataset locally to catch format issues early:

osmosis dataset validate data/train.jsonl

This checks required columns, file format, and basic JSONL/CSV/Parquet structure without uploading to the platform.

Preview a Dataset

Preview the first few rows of an uploaded dataset:

osmosis dataset preview my-dataset --rows 5

Manage Datasets

# List all datasets in the current workspace
osmosis dataset list

# Download a dataset file
osmosis dataset download my-dataset

Platform

Dataset Format

Required Columns

Optional Columns

Metadata Validation Rules

Example JSONL

Upload a Dataset

Validate Locally

Preview a Dataset

Manage Datasets

Next Steps

Training Runs

Models

​Dataset Format

​Required Columns

​Optional Columns

​Metadata Validation Rules

​Example JSONL

​Upload a Dataset

​Validate Locally

​Preview a Dataset

​Manage Datasets

​Next Steps

Training Runs

Models

Dataset Format

Required Columns

Optional Columns

Metadata Validation Rules

Example JSONL

Upload a Dataset

Validate Locally

Preview a Dataset

Manage Datasets

Next Steps