Installation
Global Usage
Authentication Commands
login
Authenticate with the Osmosis AI platform.Usage
logout
Log out and revoke CLI token.Usage
whoami
Display current authenticated user and workspace information.Usage
Output
workspace
Manage workspaces.Usage
Subcommands
| Subcommand | Description |
|---|---|
list | List all accessible workspaces |
current | Show current active workspace |
switch <workspace_id> | Switch to a different workspace |
Examples
Evaluation Commands
preview
Inspect and validate rubric configurations or dataset files.Usage
Options
| Option | Type | Required | Description |
|---|---|---|---|
--path | string | Yes | Path to the file to preview (YAML or JSONL) |
Examples
Preview a rubric configuration:Output
The command will:- Validate the file structure
- Display parsed contents in a readable format
- Show count summary (number of rubrics or records)
- Report any validation errors
eval
Evaluate a dataset against a rubric configuration.Usage
Required Options
| Option | Short | Type | Description |
|---|---|---|---|
--rubric | -r | string | Rubric ID from your configuration file |
--data | -d | string | Path to JSONL dataset file |
Optional Parameters
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--config | -c | string | Auto-discovered | Path to rubric configuration YAML |
--number | -n | integer | 1 | Number of evaluation runs per record |
--output | -o | string | ~/.cache/osmosis/... | Output path for results JSON |
--baseline | -b | string | None | Path to baseline evaluation for comparison |
Examples
Basic evaluation:Configuration Files
Rubric Configuration (YAML)
The rubric configuration file defines evaluation criteria and model settings.Structure
Required Fields
version: Configuration schema version (currently1)rubrics: List of rubric definitions
Rubric Definition Fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Unique identifier for the rubric |
title | string | Yes | Human-readable title |
rubric | string | Yes | Evaluation criteria in natural language |
model_info | object | Yes | LLM provider configuration |
score_min | float | No | Minimum score (overrides default) |
score_max | float | No | Maximum score (overrides default) |
Model Info Fields
| Field | Type | Required | Description |
|---|---|---|---|
provider | string | Yes | Provider name (see Supported Providers) |
model | string | Yes | Model identifier |
api_key_env | string | No | Environment variable name for API key |
timeout | integer | No | Request timeout in seconds (default: 30) |
Auto-Discovery
If you don’t specify--config, the CLI searches for rubric_configs.yaml in:
- Same directory as the data file
- Current working directory
./examples/subdirectory
Dataset Format (JSONL)
Each line in the JSONL file represents one evaluation record.Minimal Example
Complete Example
Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
solution_str | string | Yes | The text to be evaluated (must be non-empty) |
conversation_id | string | No | Unique identifier for this record |
rubric_id | string | No | Links to a specific rubric in config |
original_input | string | No | Original user query/prompt for context |
ground_truth | string | No | Reference answer for comparison |
metadata | object | No | Additional context passed to evaluator |
extra_info | object | No | Runtime configuration options |
score_min | float | No | Override minimum score for this record |
score_max | float | No | Override maximum score for this record |
Output Format
Console Output
During evaluation, you’ll see:JSON Output File
The output JSON file contains detailed results:Supported Providers
| Provider | Value | API Key Env | Example Models |
|---|---|---|---|
| OpenAI | openai | OPENAI_API_KEY | gpt-5 |
| Anthropic | anthropic | ANTHROPIC_API_KEY | claude-sonnet-4-5 |
| Google Gemini | gemini | GOOGLE_API_KEY | gemini-2.5-flash |
| xAI | xai | XAI_API_KEY | grok-4 |
| OpenRouter | openrouter | OPENROUTER_API_KEY | 100+ models |
| Cerebras | cerebras | CEREBRAS_API_KEY | llama3.1-405b |
Provider Configuration Example
Advanced Usage
Baseline Comparison
Compare new evaluations against a baseline to detect regressions:Variance Analysis
Run multiple evaluations per record to measure score consistency:- Understanding rubric stability
- Detecting ambiguous criteria
- A/B testing different prompts
Batch Processing
Process multiple datasets:Custom Cache Location
Override the default cache directory:Error Handling
Common Errors
API Key Not Found
Rubric Not Found
rubric_configs.yaml and ensure the rubric ID matches exactly.
Invalid JSONL Format
Model Not Found
Timeout Error
Best Practices
Writing Effective Rubrics:- Be specific and measurable
- Include clear criteria and examples
- Test with sample data before large-scale evaluation
- Include diverse examples with relevant metadata
- Validate JSONL syntax before evaluation
- Keep solution_str concise but complete
- Process datasets in batches for cost efficiency
- Start with small samples to test rubrics
- Monitor API usage through provider dashboards
Troubleshooting
Debug Mode:Remote Rollout Commands
The CLI also provides commands for running and testing remote rollout servers. See the Remote Rollout documentation for the complete guide.serve
Start a RolloutServer for an agent loop implementation.Usage
Required Options
| Option | Short | Type | Description |
|---|---|---|---|
--module | -m | string | Module path to the agent loop (e.g., my_agent:agent_loop) |
Optional Parameters
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--port | -p | integer | 9000 | Port to bind to |
--host | -H | string | 0.0.0.0 | Host to bind to |
--no-validate | flag | false | Skip agent loop validation | |
--reload | flag | false | Enable auto-reload for development | |
--log-level | string | info | Uvicorn log level (debug/info/warning/error/critical) | |
--log | string | Enable logging to specified directory | ||
--api-key | string | auto-generated | API key for TrainGate authentication | |
--local | flag | false | Local debug mode (disable auth & registration) | |
--skip-register | flag | false | Skip Osmosis Platform registration |
Examples
validate
Validate a RolloutAgentLoop implementation without starting the server.Usage
Options
| Option | Short | Type | Description |
|---|---|---|---|
--module | -m | string | Module path to the agent loop |
--verbose | -v | flag | Show detailed validation output |
Examples
test
Test a RolloutAgentLoop against a dataset using cloud LLM providers.Usage
Required Options
| Option | Short | Type | Description |
|---|---|---|---|
--module | -m | string | Module path to the agent loop |
--dataset | -d | string | Path to dataset file (.json, .jsonl, .parquet) |
Optional Parameters
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--model | string | gpt-4o | Model name (e.g., anthropic/claude-sonnet-4-20250514) | |
--max-turns | integer | 10 | Max agent turns per row | |
--temperature | float | LLM sampling temperature | ||
--max-tokens | integer | Maximum tokens per completion | ||
--limit | integer | all | Max rows to test | |
--offset | integer | 0 | Rows to skip | |
--output | -o | string | Output JSON file for results | |
--quiet | -q | flag | false | Suppress progress output |
--debug | flag | false | Enable debug output | |
--interactive | -i | flag | false | Enable interactive mode |
--row | integer | Initial row for interactive mode | ||
--api-key | string | API key for LLM provider | ||
--base-url | string | Base URL for OpenAI-compatible APIs |
Examples
Interactive Mode Commands
| Command | Description |
|---|---|
n | Execute next LLM call |
c | Continue to completion |
m | Show message history |
t | Show available tools |
q | Quit session |