Skip to main content
The osmosis-ai CLI provides commands for rubric evaluation and remote rollout server management.

Installation

pip install osmosis-ai
The CLI is accessible via three aliases:
osmosis
osmosis-ai
osmosis_ai

Global Usage

osmosis [command] [options]

Authentication Commands

login

Authenticate with the Osmosis AI platform.

Usage

osmosis login
This command opens a browser for OAuth authentication. After successful login, your CLI session will be authenticated and credentials stored locally.

logout

Log out and revoke CLI token.

Usage

osmosis logout

whoami

Display current authenticated user and workspace information.

Usage

osmosis whoami

Output

Logged in as: [email protected]
Workspace: My Workspace (ws_abc123)

workspace

Manage workspaces.

Usage

osmosis workspace [subcommand]

Subcommands

SubcommandDescription
listList all accessible workspaces
currentShow current active workspace
switch <workspace_id>Switch to a different workspace

Examples

# List all workspaces
osmosis workspace list

# Show current workspace
osmosis workspace current

# Switch workspace
osmosis workspace switch ws_abc123

Evaluation Commands

preview

Inspect and validate rubric configurations or dataset files.

Usage

osmosis preview --path <file_path>

Options

OptionTypeRequiredDescription
--pathstringYesPath to the file to preview (YAML or JSONL)

Examples

Preview a rubric configuration:
osmosis preview --path rubric_configs.yaml
Preview a dataset:
osmosis preview --path sample_data.jsonl

Output

The command will:
  • Validate the file structure
  • Display parsed contents in a readable format
  • Show count summary (number of rubrics or records)
  • Report any validation errors

eval

Evaluate a dataset against a rubric configuration.

Usage

osmosis eval --rubric <rubric_id> --data <data_path> [options]

Required Options

OptionShortTypeDescription
--rubric-rstringRubric ID from your configuration file
--data-dstringPath to JSONL dataset file

Optional Parameters

OptionShortTypeDefaultDescription
--config-cstringAuto-discoveredPath to rubric configuration YAML
--number-ninteger1Number of evaluation runs per record
--output-ostring~/.cache/osmosis/...Output path for results JSON
--baseline-bstringNonePath to baseline evaluation for comparison

Examples

Basic evaluation:
osmosis eval --rubric helpfulness --data responses.jsonl
Multiple runs for variance analysis:
osmosis eval --rubric helpfulness --data responses.jsonl --number 5
Custom output location:
osmosis eval --rubric helpfulness --data responses.jsonl --output ./results/eval_001.json
Compare against baseline:
osmosis eval --rubric helpfulness --data new_responses.jsonl --baseline ./results/baseline.json
Custom configuration file:
osmosis eval --rubric helpfulness --data responses.jsonl --config ./configs/custom_rubrics.yaml

Configuration Files

Rubric Configuration (YAML)

The rubric configuration file defines evaluation criteria and model settings.

Structure

version: 1
default_score_min: 0.0
default_score_max: 1.0

rubrics:
  - id: rubric_identifier
    title: Human-Readable Title
    rubric: |
      Your evaluation criteria here.
      Can be multiple lines.
    model_info:
      provider: openai
      model: gpt-5
      api_key_env: OPENAI_API_KEY
      timeout: 30
    score_min: 0.0  # Optional override
    score_max: 1.0  # Optional override

Required Fields

  • version: Configuration schema version (currently 1)
  • rubrics: List of rubric definitions

Rubric Definition Fields

FieldTypeRequiredDescription
idstringYesUnique identifier for the rubric
titlestringYesHuman-readable title
rubricstringYesEvaluation criteria in natural language
model_infoobjectYesLLM provider configuration
score_minfloatNoMinimum score (overrides default)
score_maxfloatNoMaximum score (overrides default)

Model Info Fields

FieldTypeRequiredDescription
providerstringYesProvider name (see Supported Providers)
modelstringYesModel identifier
api_key_envstringNoEnvironment variable name for API key
timeoutintegerNoRequest timeout in seconds (default: 30)

Auto-Discovery

If you don’t specify --config, the CLI searches for rubric_configs.yaml in:
  1. Same directory as the data file
  2. Current working directory
  3. ./examples/ subdirectory

Dataset Format (JSONL)

Each line in the JSONL file represents one evaluation record.

Minimal Example

{"solution_str": "The AI's response text to evaluate"}

Complete Example

{
  "conversation_id": "ticket-12345",
  "rubric_id": "helpfulness",
  "original_input": "How do I reset my password?",
  "solution_str": "Click 'Forgot Password' on the login page and follow the email instructions.",
  "ground_truth": "Users should use the password reset link sent to their registered email.",
  "metadata": {
    "customer_tier": "premium",
    "category": "account_management"
  },
  "score_min": 0.0,
  "score_max": 10.0
}

Field Reference

FieldTypeRequiredDescription
solution_strstringYesThe text to be evaluated (must be non-empty)
conversation_idstringNoUnique identifier for this record
rubric_idstringNoLinks to a specific rubric in config
original_inputstringNoOriginal user query/prompt for context
ground_truthstringNoReference answer for comparison
metadataobjectNoAdditional context passed to evaluator
extra_infoobjectNoRuntime configuration options
score_minfloatNoOverride minimum score for this record
score_maxfloatNoOverride maximum score for this record

Output Format

Console Output

During evaluation, you’ll see:
Evaluating records: 100%|████████████████| 50/50 [00:45<00:00, 1.1record/s]

Overall Statistics:
  Average Score: 0.847
  Min Score: 0.200
  Max Score: 1.000
  Variance: 0.034
  Std Deviation: 0.185
  Success Rate: 100.0% (50/50)

Evaluation Results:
  ...

Results saved to:
  ~/.cache/osmosis/eval_result/helpfulness/rubric_eval_result_20250114_143022.json

JSON Output File

The output JSON file contains detailed results:
{
  "rubric_id": "helpfulness",
  "timestamp": "2025-01-14T14:30:22.123456",
  "duration_seconds": 45.2,
  "total_records": 50,
  "successful_evaluations": 50,
  "failed_evaluations": 0,
  "statistics": {
    "average": 0.847,
    "min": 0.200,
    "max": 1.000,
    "variance": 0.034,
    "std_dev": 0.185
  },
  "results": [
    {
      "conversation_id": "ticket-12345",
      "runs": [
        {
          "score": 0.85,
          "explanation": "The response directly addresses the question...",
          "raw": {...},
          "duration_ms": 890
        }
      ],
      "aggregate_stats": {
        "average": 0.85,
        "variance": 0.0
      }
    }
  ],
  "model_info": {
    "provider": "openai",
    "model": "gpt-5"
  }
}

Supported Providers

ProviderValueAPI Key EnvExample Models
OpenAIopenaiOPENAI_API_KEYgpt-5
AnthropicanthropicANTHROPIC_API_KEYclaude-sonnet-4-5
Google GeminigeminiGOOGLE_API_KEYgemini-2.5-flash
xAIxaiXAI_API_KEYgrok-4
OpenRouteropenrouterOPENROUTER_API_KEY100+ models
CerebrascerebrasCEREBRAS_API_KEYllama3.1-405b

Provider Configuration Example

model_info:
  provider: openai           # Or: anthropic, gemini, xai, openrouter, cerebras
  model: gpt-5        # Provider-specific model identifier
  api_key_env: OPENAI_API_KEY  # Environment variable name
  timeout: 30               # Optional timeout in seconds

Advanced Usage

Baseline Comparison

Compare new evaluations against a baseline to detect regressions:
# Create baseline
osmosis eval --rubric helpfulness --data baseline.jsonl --output baseline.json

# Compare new data against baseline
osmosis eval --rubric helpfulness --data new_data.jsonl --baseline baseline.json
The output will include delta statistics showing improvements or regressions.

Variance Analysis

Run multiple evaluations per record to measure score consistency:
osmosis eval --rubric helpfulness --data responses.jsonl --number 10
Useful for:
  • Understanding rubric stability
  • Detecting ambiguous criteria
  • A/B testing different prompts

Batch Processing

Process multiple datasets:
for file in data/*.jsonl; do
  osmosis eval --rubric helpfulness --data "$file" --output "results/$(basename $file .jsonl).json"
done

Custom Cache Location

Override the default cache directory:
export OSMOSIS_CACHE_DIR=/path/to/custom/cache
osmosis eval --rubric helpfulness --data responses.jsonl

Error Handling

Common Errors

API Key Not Found

Error: API key not found for provider 'openai'
Solution: Set the environment variable:
export OPENAI_API_KEY="your-key-here"

Rubric Not Found

Error: Rubric 'helpfulness' not found in configuration
Solution: Check your rubric_configs.yaml and ensure the rubric ID matches exactly.

Invalid JSONL Format

Error: Invalid JSON on line 5
Solution: Validate your JSONL file. Each line must be valid JSON.

Model Not Found

Error: Model 'gpt-5' not available for provider 'openai'
Solution: Use a valid model identifier for your chosen provider.

Timeout Error

Error: Request timed out after 30 seconds
Solution: Increase the timeout in your model configuration:
model_info:
  timeout: 60

Best Practices

Writing Effective Rubrics:
  • Be specific and measurable
  • Include clear criteria and examples
  • Test with sample data before large-scale evaluation
Dataset Preparation:
  • Include diverse examples with relevant metadata
  • Validate JSONL syntax before evaluation
  • Keep solution_str concise but complete
Performance Optimization:
  • Process datasets in batches for cost efficiency
Cost Management:
  • Start with small samples to test rubrics
  • Monitor API usage through provider dashboards

Troubleshooting

Debug Mode:
export OSMOSIS_DEBUG=1
osmosis eval --rubric helpfulness --data responses.jsonl
Verify Installation:
pip show osmosis-ai
Test Setup:
osmosis preview --path rubric_configs.yaml
Check Results:
ls -lh ~/.cache/osmosis/eval_result/


Remote Rollout Commands

The CLI also provides commands for running and testing remote rollout servers. See the Remote Rollout documentation for the complete guide.

serve

Start a RolloutServer for an agent loop implementation.

Usage

osmosis serve -m <module:attribute> [options]

Required Options

OptionShortTypeDescription
--module-mstringModule path to the agent loop (e.g., my_agent:agent_loop)

Optional Parameters

OptionShortTypeDefaultDescription
--port-pinteger9000Port to bind to
--host-Hstring0.0.0.0Host to bind to
--no-validateflagfalseSkip agent loop validation
--reloadflagfalseEnable auto-reload for development
--log-levelstringinfoUvicorn log level (debug/info/warning/error/critical)
--logstringEnable logging to specified directory
--api-keystringauto-generatedAPI key for TrainGate authentication
--localflagfalseLocal debug mode (disable auth & registration)
--skip-registerflagfalseSkip Osmosis Platform registration

Examples

# Start server with validation (default port 9000)
osmosis serve -m server:agent_loop

# Specify port
osmosis serve -m server:agent_loop -p 8080

# Enable debug logging
osmosis serve -m server:agent_loop --log ./rollout_logs

# Enable auto-reload for development
osmosis serve -m server:agent_loop --reload

# Local debug mode (no auth)
osmosis serve -m server:agent_loop --local

validate

Validate a RolloutAgentLoop implementation without starting the server.

Usage

osmosis validate -m <module:attribute> [options]

Options

OptionShortTypeDescription
--module-mstringModule path to the agent loop
--verbose-vflagShow detailed validation output

Examples

# Validate agent loop
osmosis validate -m server:agent_loop

# Verbose output
osmosis validate -m server:agent_loop -v

test

Test a RolloutAgentLoop against a dataset using cloud LLM providers.

Usage

osmosis test -m <module:attribute> -d <dataset> [options]

Required Options

OptionShortTypeDescription
--module-mstringModule path to the agent loop
--dataset-dstringPath to dataset file (.json, .jsonl, .parquet)

Optional Parameters

OptionShortTypeDefaultDescription
--modelstringgpt-4oModel name (e.g., anthropic/claude-sonnet-4-20250514)
--max-turnsinteger10Max agent turns per row
--temperaturefloatLLM sampling temperature
--max-tokensintegerMaximum tokens per completion
--limitintegerallMax rows to test
--offsetinteger0Rows to skip
--output-ostringOutput JSON file for results
--quiet-qflagfalseSuppress progress output
--debugflagfalseEnable debug output
--interactive-iflagfalseEnable interactive mode
--rowintegerInitial row for interactive mode
--api-keystringAPI key for LLM provider
--base-urlstringBase URL for OpenAI-compatible APIs

Examples

# Batch test with GPT-4o (default)
osmosis test -m server:agent_loop -d multiply.parquet

# Use Claude
osmosis test -m server:agent_loop -d data.jsonl --model anthropic/claude-sonnet-4-20250514

# Test subset of data
osmosis test -m server:agent_loop -d data.jsonl --limit 10

# Save results
osmosis test -m server:agent_loop -d data.jsonl -o results.json

# Interactive debugging
osmosis test -m server:agent_loop -d data.jsonl --interactive

Interactive Mode Commands

CommandDescription
nExecute next LLM call
cContinue to completion
mShow message history
tShow available tools
qQuit session

Next Steps