Skip to main content

CLI Reference

The osmosis-ai CLI provides two main commands: preview for inspecting configurations and eval for running evaluations.

Installation

pip install osmosis-ai
The CLI is accessible via three aliases:
osmosis
osmosis-ai
osmosis_ai

Global Usage

osmosis [command] [options]

Commands

preview

Inspect and validate rubric configurations or dataset files.

Usage

osmosis preview --path <file_path>

Options

OptionTypeRequiredDescription
--pathstringYesPath to the file to preview (YAML or JSONL)

Examples

Preview a rubric configuration:
osmosis preview --path rubric_configs.yaml
Preview a dataset:
osmosis preview --path sample_data.jsonl

Output

The command will:
  • Validate the file structure
  • Display parsed contents in a readable format
  • Show count summary (number of rubrics or records)
  • Report any validation errors

eval

Evaluate a dataset against a rubric configuration.

Usage

osmosis eval --rubric <rubric_id> --data <data_path> [options]

Required Options

OptionShortTypeDescription
--rubric-rstringRubric ID from your configuration file
--data-dstringPath to JSONL dataset file

Optional Parameters

OptionShortTypeDefaultDescription
--config-cstringAuto-discoveredPath to rubric configuration YAML
--number-ninteger1Number of evaluation runs per record
--output-ostring~/.cache/osmosis/...Output path for results JSON
--baseline-bstringNonePath to baseline evaluation for comparison

Examples

Basic evaluation:
osmosis eval --rubric helpfulness --data responses.jsonl
Multiple runs for variance analysis:
osmosis eval --rubric helpfulness --data responses.jsonl --number 5
Custom output location:
osmosis eval --rubric helpfulness --data responses.jsonl --output ./results/eval_001.json
Compare against baseline:
osmosis eval --rubric helpfulness --data new_responses.jsonl --baseline ./results/baseline.json
Custom configuration file:
osmosis eval --rubric helpfulness --data responses.jsonl --config ./configs/custom_rubrics.yaml

Configuration Files

Rubric Configuration (YAML)

The rubric configuration file defines evaluation criteria and model settings.

Structure

version: 1
default_score_min: 0.0
default_score_max: 1.0

rubrics:
  - id: rubric_identifier
    title: Human-Readable Title
    rubric: |
      Your evaluation criteria here.
      Can be multiple lines.
    model_info:
      provider: openai
      model: gpt-5
      api_key_env: OPENAI_API_KEY
      timeout: 30
    score_min: 0.0  # Optional override
    score_max: 1.0  # Optional override

Required Fields

  • version: Configuration schema version (currently 1)
  • rubrics: List of rubric definitions

Rubric Definition Fields

FieldTypeRequiredDescription
idstringYesUnique identifier for the rubric
titlestringYesHuman-readable title
rubricstringYesEvaluation criteria in natural language
model_infoobjectYesLLM provider configuration
score_minfloatNoMinimum score (overrides default)
score_maxfloatNoMaximum score (overrides default)

Model Info Fields

FieldTypeRequiredDescription
providerstringYesProvider name (see Supported Providers)
modelstringYesModel identifier
api_key_envstringNoEnvironment variable name for API key
timeoutintegerNoRequest timeout in seconds (default: 30)

Auto-Discovery

If you don’t specify --config, the CLI searches for rubric_configs.yaml in:
  1. Same directory as the data file
  2. Current working directory
  3. ./examples/ subdirectory

Dataset Format (JSONL)

Each line in the JSONL file represents one evaluation record.

Minimal Example

{"solution_str": "The AI's response text to evaluate"}

Complete Example

{
  "conversation_id": "ticket-12345",
  "rubric_id": "helpfulness",
  "original_input": "How do I reset my password?",
  "solution_str": "Click 'Forgot Password' on the login page and follow the email instructions.",
  "ground_truth": "Users should use the password reset link sent to their registered email.",
  "metadata": {
    "customer_tier": "premium",
    "category": "account_management"
  },
  "score_min": 0.0,
  "score_max": 10.0
}

Field Reference

FieldTypeRequiredDescription
solution_strstringYesThe text to be evaluated (must be non-empty)
conversation_idstringNoUnique identifier for this record
rubric_idstringNoLinks to a specific rubric in config
original_inputstringNoOriginal user query/prompt for context
ground_truthstringNoReference answer for comparison
metadataobjectNoAdditional context passed to evaluator
extra_infoobjectNoRuntime configuration options
score_minfloatNoOverride minimum score for this record
score_maxfloatNoOverride maximum score for this record

Output Format

Console Output

During evaluation, you’ll see:
Evaluating records: 100%|████████████████| 50/50 [00:45<00:00, 1.1record/s]

Overall Statistics:
  Average Score: 0.847
  Min Score: 0.200
  Max Score: 1.000
  Variance: 0.034
  Std Deviation: 0.185
  Success Rate: 100.0% (50/50)

Evaluation Results:
  ...

Results saved to:
  ~/.cache/osmosis/eval_result/helpfulness/rubric_eval_result_20250114_143022.json

JSON Output File

The output JSON file contains detailed results:
{
  "rubric_id": "helpfulness",
  "timestamp": "2025-01-14T14:30:22.123456",
  "duration_seconds": 45.2,
  "total_records": 50,
  "successful_evaluations": 50,
  "failed_evaluations": 0,
  "statistics": {
    "average": 0.847,
    "min": 0.200,
    "max": 1.000,
    "variance": 0.034,
    "std_dev": 0.185
  },
  "results": [
    {
      "conversation_id": "ticket-12345",
      "runs": [
        {
          "score": 0.85,
          "explanation": "The response directly addresses the question...",
          "raw_payload": {...},
          "duration_ms": 890
        }
      ],
      "aggregate_stats": {
        "average": 0.85,
        "variance": 0.0
      }
    }
  ],
  "model_info": {
    "provider": "openai",
    "model": "gpt-5"
  }
}

Supported Providers

ProviderValueAPI Key EnvExample Models
OpenAIopenaiOPENAI_API_KEYgpt-5
AnthropicanthropicANTHROPIC_API_KEYclaude-sonnet-4-5
Google GeminigeminiGOOGLE_API_KEYgemini-2.5-flash
xAIxaiXAI_API_KEYgrok-4
OpenRouteropenrouterOPENROUTER_API_KEY100+ models
CerebrascerebrasCEREBRAS_API_KEYllama3.1-405b

Provider Configuration Example

model_info:
  provider: openai           # Or: anthropic, gemini, xai, openrouter, cerebras
  model: gpt-5        # Provider-specific model identifier
  api_key_env: OPENAI_API_KEY  # Environment variable name
  timeout: 30               # Optional timeout in seconds

Advanced Usage

Baseline Comparison

Compare new evaluations against a baseline to detect regressions:
# Create baseline
osmosis eval --rubric helpfulness --data baseline.jsonl --output baseline.json

# Compare new data against baseline
osmosis eval --rubric helpfulness --data new_data.jsonl --baseline baseline.json
The output will include delta statistics showing improvements or regressions.

Variance Analysis

Run multiple evaluations per record to measure score consistency:
osmosis eval --rubric helpfulness --data responses.jsonl --number 10
Useful for:
  • Understanding rubric stability
  • Detecting ambiguous criteria
  • A/B testing different prompts

Batch Processing

Process multiple datasets:
for file in data/*.jsonl; do
  osmosis eval --rubric helpfulness --data "$file" --output "results/$(basename $file .jsonl).json"
done

Custom Cache Location

Override the default cache directory:
export OSMOSIS_CACHE_DIR=/path/to/custom/cache
osmosis eval --rubric helpfulness --data responses.jsonl

Error Handling

Common Errors

API Key Not Found

Error: API key not found for provider 'openai'
Solution: Set the environment variable:
export OPENAI_API_KEY="your-key-here"

Rubric Not Found

Error: Rubric 'helpfulness' not found in configuration
Solution: Check your rubric_configs.yaml and ensure the rubric ID matches exactly.

Invalid JSONL Format

Error: Invalid JSON on line 5
Solution: Validate your JSONL file. Each line must be valid JSON.

Model Not Found

Error: Model 'gpt-5' not available for provider 'openai'
Solution: Use a valid model identifier for your chosen provider.

Timeout Error

Error: Request timed out after 30 seconds
Solution: Increase the timeout in your model configuration:
model_info:
  timeout: 60

Best Practices

Writing Effective Rubrics:
  • Be specific and measurable
  • Include clear criteria and examples
  • Test with sample data before large-scale evaluation
Dataset Preparation:
  • Include diverse examples with relevant metadata
  • Validate JSONL syntax before evaluation
  • Keep solution_str concise but complete
Performance Optimization:
  • Process datasets in batches for cost efficiency
Cost Management:
  • Start with small samples to test rubrics
  • Monitor API usage through provider dashboards

Troubleshooting

Debug Mode:
export OSMOSIS_DEBUG=1
osmosis eval --rubric helpfulness --data responses.jsonl
Verify Installation:
pip show osmosis-ai
Test Setup:
osmosis preview --path rubric_configs.yaml
Check Results:
ls -lh ~/.cache/osmosis/eval_result/

Next Steps