CLI Reference - Osmosis API

The osmosis-ai CLI provides commands for rubric evaluation and remote rollout server management.

Installation

pip install osmosis-ai

The CLI is accessible via three aliases:

osmosis
osmosis-ai
osmosis_ai

Global Usage

osmosis [command] [options]

Authentication Commands

Authenticate with the Osmosis AI platform.

Usage

osmosis login

This command opens a browser for OAuth authentication. After successful login, your CLI session will be authenticated and credentials stored locally.

logout

Log out and revoke CLI token.

Usage

osmosis logout

whoami

Display current authenticated user and workspace information.

Usage

osmosis whoami

Output

Logged in as: [email protected]
Workspace: My Workspace (ws_abc123)

workspace

Manage workspaces.

Usage

osmosis workspace [subcommand]

Subcommands

Subcommand	Description
`list`	List all accessible workspaces
`current`	Show current active workspace
`switch <workspace_id>`	Switch to a different workspace

Examples

# List all workspaces
osmosis workspace list

# Show current workspace
osmosis workspace current

# Switch workspace
osmosis workspace switch ws_abc123

Evaluation Commands

preview

Inspect and validate rubric configurations or dataset files.

Usage

osmosis preview --path <file_path>

Options

Option	Type	Required	Description
`--path`	string	Yes	Path to the file to preview (YAML or JSONL)

Examples

Preview a rubric configuration:

osmosis preview --path rubric_configs.yaml

Preview a dataset:

osmosis preview --path sample_data.jsonl

Output

The command will:

Validate the file structure
Display parsed contents in a readable format
Show count summary (number of rubrics or records)
Report any validation errors

eval

Evaluate a dataset against a rubric configuration.

Usage

osmosis eval --rubric <rubric_id> --data <data_path> [options]

Required Options

Option	Short	Type	Description
`--rubric`	`-r`	string	Rubric ID from your configuration file
`--data`	`-d`	string	Path to JSONL dataset file

Optional Parameters

Option	Short	Type	Default	Description
`--config`	`-c`	string	Auto-discovered	Path to rubric configuration YAML
`--number`	`-n`	integer	1	Number of evaluation runs per record
`--output`	`-o`	string	`~/.cache/osmosis/...`	Output path for results JSON
`--baseline`	`-b`	string	None	Path to baseline evaluation for comparison

Examples

Basic evaluation:

osmosis eval --rubric helpfulness --data responses.jsonl

Multiple runs for variance analysis:

osmosis eval --rubric helpfulness --data responses.jsonl --number 5

Custom output location:

osmosis eval --rubric helpfulness --data responses.jsonl --output ./results/eval_001.json

Compare against baseline:

osmosis eval --rubric helpfulness --data new_responses.jsonl --baseline ./results/baseline.json

Custom configuration file:

osmosis eval --rubric helpfulness --data responses.jsonl --config ./configs/custom_rubrics.yaml

Configuration Files

Rubric Configuration (YAML)

The rubric configuration file defines evaluation criteria and model settings.

Structure

version: 1
default_score_min: 0.0
default_score_max: 1.0

rubrics:
  - id: rubric_identifier
    title: Human-Readable Title
    rubric: |
      Your evaluation criteria here.
      Can be multiple lines.
    model_info:
      provider: openai
      model: gpt-5
      api_key_env: OPENAI_API_KEY
      timeout: 30
    score_min: 0.0  # Optional override
    score_max: 1.0  # Optional override

Required Fields

version: Configuration schema version (currently 1)
rubrics: List of rubric definitions

Rubric Definition Fields

Field	Type	Required	Description
`id`	string	Yes	Unique identifier for the rubric
`title`	string	Yes	Human-readable title
`rubric`	string	Yes	Evaluation criteria in natural language
`model_info`	object	Yes	LLM provider configuration
`score_min`	float	No	Minimum score (overrides default)
`score_max`	float	No	Maximum score (overrides default)

Model Info Fields

Field	Type	Required	Description
`provider`	string	Yes	Provider name (see Supported Providers)
`model`	string	Yes	Model identifier
`api_key_env`	string	No	Environment variable name for API key
`timeout`	integer	No	Request timeout in seconds (default: 30)

Auto-Discovery

If you don’t specify --config, the CLI searches for rubric_configs.yaml in:

Same directory as the data file
Current working directory
./examples/ subdirectory

Dataset Format (JSONL)

Each line in the JSONL file represents one evaluation record.

Minimal Example

{"solution_str": "The AI's response text to evaluate"}

Complete Example

{
  "conversation_id": "ticket-12345",
  "rubric_id": "helpfulness",
  "original_input": "How do I reset my password?",
  "solution_str": "Click 'Forgot Password' on the login page and follow the email instructions.",
  "ground_truth": "Users should use the password reset link sent to their registered email.",
  "metadata": {
    "customer_tier": "premium",
    "category": "account_management"
  },
  "score_min": 0.0,
  "score_max": 10.0
}

Field Reference

Field	Type	Required	Description
`solution_str`	string	Yes	The text to be evaluated (must be non-empty)
`conversation_id`	string	No	Unique identifier for this record
`rubric_id`	string	No	Links to a specific rubric in config
`original_input`	string	No	Original user query/prompt for context
`ground_truth`	string	No	Reference answer for comparison
`metadata`	object	No	Additional context passed to evaluator
`extra_info`	object	No	Runtime configuration options
`score_min`	float	No	Override minimum score for this record
`score_max`	float	No	Override maximum score for this record

Output Format

Console Output

During evaluation, you’ll see:

Evaluating records: 100%|████████████████| 50/50 [00:45<00:00, 1.1record/s]

Overall Statistics:
  Average Score: 0.847
  Min Score: 0.200
  Max Score: 1.000
  Variance: 0.034
  Std Deviation: 0.185
  Success Rate: 100.0% (50/50)

Evaluation Results:
  ...

Results saved to:
  ~/.cache/osmosis/eval_result/helpfulness/rubric_eval_result_20250114_143022.json

JSON Output File

The output JSON file contains detailed results:

{
  "rubric_id": "helpfulness",
  "timestamp": "2025-01-14T14:30:22.123456",
  "duration_seconds": 45.2,
  "total_records": 50,
  "successful_evaluations": 50,
  "failed_evaluations": 0,
  "statistics": {
    "average": 0.847,
    "min": 0.200,
    "max": 1.000,
    "variance": 0.034,
    "std_dev": 0.185
  },
  "results": [
    {
      "conversation_id": "ticket-12345",
      "runs": [
        {
          "score": 0.85,
          "explanation": "The response directly addresses the question...",
          "raw": {...},
          "duration_ms": 890
        }
      ],
      "aggregate_stats": {
        "average": 0.85,
        "variance": 0.0
      }
    }
  ],
  "model_info": {
    "provider": "openai",
    "model": "gpt-5"
  }
}

Supported Providers

Provider	Value	API Key Env	Example Models
OpenAI	`openai`	`OPENAI_API_KEY`	gpt-5
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`	claude-sonnet-4-5
Google Gemini	`gemini`	`GOOGLE_API_KEY`	gemini-2.5-flash
xAI	`xai`	`XAI_API_KEY`	grok-4
OpenRouter	`openrouter`	`OPENROUTER_API_KEY`	100+ models
Cerebras	`cerebras`	`CEREBRAS_API_KEY`	llama3.1-405b

Provider Configuration Example

model_info:
  provider: openai           # Or: anthropic, gemini, xai, openrouter, cerebras
  model: gpt-5        # Provider-specific model identifier
  api_key_env: OPENAI_API_KEY  # Environment variable name
  timeout: 30               # Optional timeout in seconds

Advanced Usage

Baseline Comparison

Compare new evaluations against a baseline to detect regressions:

# Create baseline
osmosis eval --rubric helpfulness --data baseline.jsonl --output baseline.json

# Compare new data against baseline
osmosis eval --rubric helpfulness --data new_data.jsonl --baseline baseline.json

The output will include delta statistics showing improvements or regressions.

Variance Analysis

Run multiple evaluations per record to measure score consistency:

osmosis eval --rubric helpfulness --data responses.jsonl --number 10

Useful for:

Understanding rubric stability
Detecting ambiguous criteria
A/B testing different prompts

Batch Processing

Process multiple datasets:

for file in data/*.jsonl; do
  osmosis eval --rubric helpfulness --data "$file" --output "results/$(basename $file .jsonl).json"
done

Custom Cache Location

Override the default cache directory:

export OSMOSIS_CACHE_DIR=/path/to/custom/cache
osmosis eval --rubric helpfulness --data responses.jsonl

Error Handling

Common Errors

API Key Not Found

Error: API key not found for provider 'openai'

Solution: Set the environment variable:

export OPENAI_API_KEY="your-key-here"

Rubric Not Found

Error: Rubric 'helpfulness' not found in configuration

Solution: Check your rubric_configs.yaml and ensure the rubric ID matches exactly.

Invalid JSONL Format

Error: Invalid JSON on line 5

Solution: Validate your JSONL file. Each line must be valid JSON.

Model Not Found

Error: Model 'gpt-5' not available for provider 'openai'

Solution: Use a valid model identifier for your chosen provider.

Timeout Error

Error: Request timed out after 30 seconds

Solution: Increase the timeout in your model configuration:

model_info:
  timeout: 60

Best Practices

Writing Effective Rubrics:

Be specific and measurable
Include clear criteria and examples
Test with sample data before large-scale evaluation

Dataset Preparation:

Include diverse examples with relevant metadata
Validate JSONL syntax before evaluation
Keep solution_str concise but complete

Performance Optimization:

Process datasets in batches for cost efficiency

Cost Management:

Start with small samples to test rubrics
Monitor API usage through provider dashboards

Troubleshooting

Debug Mode:

export OSMOSIS_DEBUG=1
osmosis eval --rubric helpfulness --data responses.jsonl

Verify Installation:

pip show osmosis-ai

Test Setup:

osmosis preview --path rubric_configs.yaml

Check Results:

ls -lh ~/.cache/osmosis/eval_result/

Remote Rollout Commands

The CLI also provides commands for running and testing remote rollout servers. See the Remote Rollout documentation for the complete guide.

serve

Start a RolloutServer for an agent loop implementation.

Usage

osmosis serve -m <module:attribute> [options]

Required Options

Option	Short	Type	Description
`--module`	`-m`	string	Module path to the agent loop (e.g., `my_agent:agent_loop`)

Optional Parameters

Option	Short	Type	Default	Description
`--port`	`-p`	integer	`9000`	Port to bind to
`--host`	`-H`	string	`0.0.0.0`	Host to bind to
`--no-validate`		flag	`false`	Skip agent loop validation
`--reload`		flag	`false`	Enable auto-reload for development
`--log-level`		string	`info`	Uvicorn log level (debug/info/warning/error/critical)
`--log`		string		Enable logging to specified directory
`--api-key`		string	auto-generated	API key for TrainGate authentication
`--local`		flag	`false`	Local debug mode (disable auth & registration)
`--skip-register`		flag	`false`	Skip Osmosis Platform registration

Examples

# Start server with validation (default port 9000)
osmosis serve -m server:agent_loop

# Specify port
osmosis serve -m server:agent_loop -p 8080

# Enable debug logging
osmosis serve -m server:agent_loop --log ./rollout_logs

# Enable auto-reload for development
osmosis serve -m server:agent_loop --reload

# Local debug mode (no auth)
osmosis serve -m server:agent_loop --local

validate

Validate a RolloutAgentLoop implementation without starting the server.

Usage

osmosis validate -m <module:attribute> [options]

Options

Option	Short	Type	Description
`--module`	`-m`	string	Module path to the agent loop
`--verbose`	`-v`	flag	Show detailed validation output

Examples

# Validate agent loop
osmosis validate -m server:agent_loop

# Verbose output
osmosis validate -m server:agent_loop -v

test

Test a RolloutAgentLoop against a dataset using cloud LLM providers.

Usage

osmosis test -m <module:attribute> -d <dataset> [options]

Required Options

Option	Short	Type	Description
`--module`	`-m`	string	Module path to the agent loop
`--dataset`	`-d`	string	Path to dataset file (.json, .jsonl, .parquet)

Optional Parameters

Option	Short	Type	Default	Description
`--model`		string	`gpt-4o`	Model name (e.g., `anthropic/claude-sonnet-4-20250514`)
`--max-turns`		integer	`10`	Max agent turns per row
`--temperature`		float		LLM sampling temperature
`--max-tokens`		integer		Maximum tokens per completion
`--limit`		integer	all	Max rows to test
`--offset`		integer	`0`	Rows to skip
`--output`	`-o`	string		Output JSON file for results
`--quiet`	`-q`	flag	`false`	Suppress progress output
`--debug`		flag	`false`	Enable debug output
`--interactive`	`-i`	flag	`false`	Enable interactive mode
`--row`		integer		Initial row for interactive mode
`--api-key`		string		API key for LLM provider
`--base-url`		string		Base URL for OpenAI-compatible APIs

Examples

# Batch test with GPT-4o (default)
osmosis test -m server:agent_loop -d multiply.parquet

# Use Claude
osmosis test -m server:agent_loop -d data.jsonl --model anthropic/claude-sonnet-4-20250514

# Test subset of data
osmosis test -m server:agent_loop -d data.jsonl --limit 10

# Save results
osmosis test -m server:agent_loop -d data.jsonl -o results.json

# Interactive debugging
osmosis test -m server:agent_loop -d data.jsonl --interactive

Interactive Mode Commands

Command	Description
`n`	Execute next LLM call
`c`	Continue to completion
`m`	Show message history
`t`	Show available tools
`q`	Quit session

Next Steps

Git Sync

Define reward functions and rubrics

Remote Rollout

Build custom agent loops

API Reference

Complete API documentation

Python SDK

​Installation

​Global Usage

​Authentication Commands

​login

​Usage

​logout

​Usage

​whoami

​Usage

​Output

​workspace

​Usage

​Subcommands

​Examples

​Evaluation Commands

​preview

​Usage

​Options

​Examples

​Output

​eval

​Usage

​Required Options

​Optional Parameters

​Examples

​Configuration Files

​Rubric Configuration (YAML)

​Structure

​Required Fields

​Rubric Definition Fields

​Model Info Fields

​Auto-Discovery

​Dataset Format (JSONL)

​Minimal Example

​Complete Example

​Field Reference

​Output Format

​Console Output

​JSON Output File

​Supported Providers

​Provider Configuration Example

​Advanced Usage

​Baseline Comparison

​Variance Analysis

​Batch Processing

​Custom Cache Location

​Error Handling

​Common Errors

​API Key Not Found

​Rubric Not Found

​Invalid JSONL Format

​Model Not Found

​Timeout Error

​Best Practices

​Troubleshooting

​Remote Rollout Commands

​serve

​Usage

​Required Options

​Optional Parameters

​Examples

​validate

​Usage

​Options

​Examples

​test

​Usage

​Required Options

​Optional Parameters

​Examples

​Interactive Mode Commands

​Next Steps

Git Sync

Remote Rollout

API Reference

Installation

Global Usage

Authentication Commands

login

Usage

logout

Usage

whoami

Usage

Output

workspace

Usage

Subcommands

Examples

Evaluation Commands

preview

Usage

Options

Examples

Output

eval

Usage

Required Options

Optional Parameters

Examples

Configuration Files

Rubric Configuration (YAML)

Structure

Required Fields

Rubric Definition Fields

Model Info Fields

Auto-Discovery

Dataset Format (JSONL)

Minimal Example

Complete Example

Field Reference

Output Format

Console Output

JSON Output File

Supported Providers

Provider Configuration Example

Advanced Usage

Baseline Comparison

Variance Analysis

Batch Processing

Custom Cache Location

Error Handling

Common Errors

API Key Not Found

Rubric Not Found

Invalid JSONL Format

Model Not Found

Timeout Error

Best Practices

Troubleshooting

Remote Rollout Commands

serve

Usage

Required Options

Optional Parameters

Examples

validate

Usage

Options

Examples

test

Usage

Required Options

Optional Parameters

Examples

Interactive Mode Commands

Next Steps