CLI 参考 - Osmosis

osmosis-ai CLI 提供了用于评分标准评估和远程 rollout 服务器管理的命令。

安装

pip install osmosis-ai

CLI 可通过以下三个别名访问：

osmosis
osmosis-ai
osmosis_ai

全局用法

osmosis [command] [options]

版本

osmosis --version
osmosis -V

显示已安装的 SDK 版本号（例如 osmosis-ai 0.2.13）。

身份认证命令

通过 Osmosis AI 平台进行身份认证。

用法

osmosis login

此命令会打开浏览器进行 OAuth 身份认证。登录成功后，您的 CLI 会话将被认证，凭据将存储在本地。

logout

登出并撤销 CLI 令牌。

用法

osmosis logout

whoami

显示当前已认证的用户和工作区信息。

用法

osmosis whoami

输出

Logged in as: user@example.com
Workspace: My Workspace (ws_abc123)

workspace

管理工作区。

用法

osmosis workspace [subcommand]

子命令

Subcommand	Description
`list`	列出所有可访问的工作区
`current`	显示当前活跃的工作区
`switch <workspace_id>`	切换到其他工作区

示例

# List all workspaces
osmosis workspace list

# Show current workspace
osmosis workspace current

# Switch workspace
osmosis workspace switch ws_abc123

评估命令

preview

检查和验证评分标准配置或数据集文件。

用法

osmosis preview --path <file_path>

选项

Option	Type	Required	Description
`--path`	string	是	要预览的文件路径（YAML 或 JSONL）

示例

预览评分标准配置：

osmosis preview --path rubric_configs.yaml

预览数据集：

osmosis preview --path sample_data.jsonl

输出

该命令将：

验证文件结构
以可读格式显示解析后的内容
显示数量摘要（评分标准或记录的数量）
报告任何验证错误

eval-rubric

使用远程 LLM 提供商根据评分标准评估 JSONL 对话。

用法

osmosis eval-rubric --rubric <rubric_id> --data <data_path> [options]

必填选项

Option	Short	Type	Description
`--rubric`	`-r`	string	配置文件中的评分标准 ID
`--data`	`-d`	string	JSONL 数据集文件路径

可选参数

Option	Short	Type	Default	Description
`--config`	`-c`	string	自动发现	评分标准配置 YAML 的路径
`--number`	`-n`	integer	1	每条记录的评估运行次数
`--output`	`-o`	string	`~/.cache/osmosis/...`	结果 JSON 的输出路径
`--baseline`	`-b`	string	None	用于对比的基准评估路径

示例

基本评估：

osmosis eval-rubric --rubric helpfulness --data responses.jsonl

多次运行进行方差分析：

osmosis eval-rubric --rubric helpfulness --data responses.jsonl --number 5

自定义输出位置：

osmosis eval-rubric --rubric helpfulness --data responses.jsonl --output ./results/eval_001.json

与基准对比：

osmosis eval-rubric --rubric helpfulness --data new_responses.jsonl --baseline ./results/baseline.json

自定义配置文件：

osmosis eval-rubric --rubric helpfulness --data responses.jsonl --config ./configs/custom_rubrics.yaml

配置文件

评分标准配置（YAML）

评分标准配置文件定义了评估标准和模型设置。

结构

version: 1
default_score_min: 0.0
default_score_max: 1.0

rubrics:
  - id: rubric_identifier
    title: Human-Readable Title
    rubric: |
      Your evaluation criteria here.
      Can be multiple lines.
    model_info:
      provider: openai
      model: gpt-5.2
      api_key_env: OPENAI_API_KEY
      timeout: 30
    score_min: 0.0  # Optional override
    score_max: 1.0  # Optional override

必填字段

version：配置模式版本（当前为 1）
rubrics：评分标准定义列表

评分标准定义字段

Field	Type	Required	Description
`id`	string	是	评分标准的唯一标识符
`title`	string	是	人类可读的标题
`rubric`	string	是	自然语言描述的评估标准
`model_info`	object	是	LLM 提供商配置
`score_min`	float	否	最低分数（覆盖默认值）
`score_max`	float	否	最高分数（覆盖默认值）

Model Info 字段

Field	Type	Required	Description
`provider`	string	是	提供商名称（见支持的提供商）
`model`	string	是	模型标识符
`api_key_env`	string	否	API 密钥的环境变量名
`timeout`	integer	否	请求超时时间（秒）（默认：30）

自动发现

如果未指定 --config，CLI 将在以下位置搜索 rubric_configs.yaml：

数据集格式（JSONL）

JSONL 文件中的每一行代表一条评估记录。

最小示例

{"solution_str": "The AI's response text to evaluate"}

完整示例

{
  "conversation_id": "ticket-12345",
  "rubric_id": "helpfulness",
  "original_input": "How do I reset my password?",
  "solution_str": "Click 'Forgot Password' on the login page and follow the email instructions.",
  "ground_truth": "Users should use the password reset link sent to their registered email.",
  "metadata": {
    "customer_tier": "premium",
    "category": "account_management"
  },
  "score_min": 0.0,
  "score_max": 10.0
}

字段参考

Field	Type	Required	Description
`solution_str`	string	是	待评估的文本（不能为空）
`conversation_id`	string	否	此记录的唯一标识符
`rubric_id`	string	否	关联配置中特定评分标准
`original_input`	string	否	原始用户查询/提示词，用于上下文
`ground_truth`	string	否	用于对比的参考答案
`metadata`	object	否	传递给评估器的额外上下文
`extra_info`	object	否	运行时配置选项
`score_min`	float	否	覆盖此记录的最低分数
`score_max`	float	否	覆盖此记录的最高分数

输出格式

控制台输出

评估期间，您将看到：

Evaluating records: 100%|████████████████| 50/50 [00:45<00:00, 1.1record/s]

Overall Statistics:
  Average Score: 0.847
  Min Score: 0.200
  Max Score: 1.000
  Variance: 0.034
  Std Deviation: 0.185
  Success Rate: 100.0% (50/50)

Evaluation Results:
  ...

Results saved to:
  ~/.cache/osmosis/eval_result/helpfulness/rubric_eval_result_20250114_143022.json

JSON 输出文件

输出 JSON 文件包含详细结果：

{
  "rubric_id": "helpfulness",
  "timestamp": "2025-01-14T14:30:22.123456",
  "duration_seconds": 45.2,
  "total_records": 50,
  "successful_evaluations": 50,
  "failed_evaluations": 0,
  "statistics": {
    "average": 0.847,
    "min": 0.200,
    "max": 1.000,
    "variance": 0.034,
    "std_dev": 0.185
  },
  "results": [
    {
      "conversation_id": "ticket-12345",
      "runs": [
        {
          "score": 0.85,
          "explanation": "The response directly addresses the question...",
          "raw": {...},
          "duration_ms": 890
        }
      ],
      "aggregate_stats": {
        "average": 0.85,
        "variance": 0.0
      }
    }
  ],
  "model_info": {
    "provider": "openai",
    "model": "gpt-5.2"
  }
}

高级用法

基准对比

将新评估与基准进行对比以检测退化：

# Create baseline
osmosis eval-rubric --rubric helpfulness --data baseline.jsonl --output baseline.json

# Compare new data against baseline
osmosis eval-rubric --rubric helpfulness --data new_data.jsonl --baseline baseline.json

输出将包含差值统计数据，显示改进或退化情况。

方差分析

对每条记录运行多次评估以衡量分数一致性：

osmosis eval-rubric --rubric helpfulness --data responses.jsonl --number 10

适用于：

了解评分标准的稳定性
检测模糊的评估标准
A/B 测试不同的提示词

批量处理

处理多个数据集：

for file in data/*.jsonl; do
  osmosis eval-rubric --rubric helpfulness --data "$file" --output "results/$(basename $file .jsonl).json"
done

自定义缓存位置

覆盖默认的缓存目录：

export OSMOSIS_CACHE_DIR=/path/to/custom/cache
osmosis eval-rubric --rubric helpfulness --data responses.jsonl

错误处理

常见错误

API 密钥未找到

Error: API key not found for provider 'openai'

解决方案： 设置环境变量：

export OPENAI_API_KEY="your-key-here"

评分标准未找到

Error: Rubric 'helpfulness' not found in configuration

解决方案： 检查您的 rubric_configs.yaml 并确保评分标准 ID 完全匹配。

无效的 JSONL 格式

Error: Invalid JSON on line 5

解决方案： 验证您的 JSONL 文件。每一行必须是有效的 JSON。

模型未找到

Error: Model 'gpt-5.2' not available for provider 'openai'

解决方案： 使用您所选提供商的有效模型标识符。

超时错误

Error: Request timed out after 30 seconds

解决方案： 在模型配置中增加超时时间：

model_info:
  timeout: 60

远程 Rollout 命令

CLI 还提供了用于运行和测试远程 rollout 服务器的命令。完整指南请参阅远程 Rollout 文档。

serve

为智能体循环实现启动 RolloutServer。

用法

osmosis serve -m <module:attribute> [options]

必填选项

Option	Short	Type	Description
`--module`	`-m`	string	智能体循环的模块路径（例如 `my_agent:agent_loop`）

可选参数

Option	Short	Type	Default	Description
`--port`	`-p`	integer	`9000`	绑定端口
`--host`	`-H`	string	`0.0.0.0`	绑定主机
`--no-validate`		flag	`false`	跳过智能体循环验证
`--reload`		flag	`false`	启用开发模式自动重载
`--log-level`		string	`info`	Uvicorn 日志级别 (debug/info/warning/error/critical)
`--log`		string		启用日志记录到指定目录
`--api-key`		string	自动生成	TrainGate 认证的 API 密钥
`--local`		flag	`false`	本地调试模式（禁用认证和注册）
`--skip-register`		flag	`false`	跳过 Osmosis 平台注册

示例

# Start server with validation (default port 9000)
osmosis serve -m server:agent_loop

# Specify port
osmosis serve -m server:agent_loop -p 8080

# Enable debug logging
osmosis serve -m server:agent_loop --log ./rollout_logs

# Enable auto-reload for development
osmosis serve -m server:agent_loop --reload

# Local debug mode (no auth)
osmosis serve -m server:agent_loop --local

validate

在不启动服务器的情况下验证 RolloutAgentLoop 实现。

用法

osmosis validate -m <module:attribute> [options]

选项

Option	Short	Type	Description
`--module`	`-m`	string	智能体循环的模块路径
`--verbose`	`-v`	flag	显示详细的验证输出

示例

# Validate agent loop
osmosis validate -m server:agent_loop

# Verbose output
osmosis validate -m server:agent_loop -v

test

使用云端 LLM 提供商针对数据集测试 RolloutAgentLoop。

用法

osmosis test -m <module:attribute> -d <dataset> [options]

必填选项

Option	Short	Type	Description
`--module`	`-m`	string	智能体循环的模块路径
`--dataset`	`-d`	string	数据集文件路径（.json、.jsonl、.parquet）

可选参数

Option	Short	Type	Default	Description
`--model`		string	`gpt-5-mini`	模型名称（例如 `anthropic/claude-sonnet-4-5`）
`--max-turns`		integer	`10`	每行的最大智能体回合数
`--temperature`		float		LLM 采样温度
`--max-tokens`		integer		每次补全的最大 token 数
`--limit`		integer	全部	要测试的最大行数
`--offset`		integer	`0`	要跳过的行数
`--output`	`-o`	string		结果输出 JSON 文件
`--quiet`	`-q`	flag	`false`	抑制进度输出
`--debug`		flag	`false`	启用调试输出
`--interactive`	`-i`	flag	`false`	启用交互模式
`--row`		integer		交互模式的初始行
`--api-key`		string		LLM 提供商的 API 密钥
`--base-url`		string		OpenAI 兼容 API 的基础 URL

示例

# Batch test with default model
osmosis test -m server:agent_loop -d multiply.parquet

# Use Claude
osmosis test -m server:agent_loop -d data.jsonl --model anthropic/claude-sonnet-4-5

# Test subset of data
osmosis test -m server:agent_loop -d data.jsonl --limit 10

# Save results
osmosis test -m server:agent_loop -d data.jsonl -o results.json

# Interactive debugging
osmosis test -m server:agent_loop -d data.jsonl --interactive

交互模式命令

Command	Description
`n`	执行下一次 LLM 调用
`c`	继续至完成
`m`	显示消息历史
`t`	显示可用工具
`q`	退出会话

eval

使用自定义评估函数和 pass@k 指标评估训练模型。

用法

osmosis eval -m <module:attribute> -d <dataset> --model <model> --eval-fn <module:fn> [options]

必填选项

Option	Short	Type	Description
`--module`	`-m`	string	智能体循环的模块路径
`--dataset`	`-d`	string	数据集文件路径（.json、.jsonl、.parquet）
`--model`		string	服务端点的模型名称，或 LiteLLM 格式用于基线对比
`--eval-fn`		string	评估函数，格式为 `module:function`（可多次指定）

可选参数

Option	Short	Type	Default	Description
`--n`		integer	`1`	每行的运行次数（用于 pass@k）
`--pass-threshold`		float	`1.0`	得分 >= 阈值即视为通过
`--max-turns`		integer	`10`	每次运行的最大 Agent 回合数
`--temperature`		float		LLM 采样温度
`--max-tokens`		integer		每次补全的最大 token 数
`--batch-size`		integer	`1`	并发运行数
`--limit`		integer	全部	要评估的最大行数
`--offset`		integer	`0`	要跳过的行数
`--output`	`-o`	string		结果输出 JSON 文件
`--quiet`	`-q`	flag	`false`	抑制进度输出
`--debug`		flag	`false`	启用调试日志
`--api-key`		string		模型端点或 LLM 提供商的 API 密钥
`--base-url`		string		OpenAI 兼容 API 的基础 URL

示例

# Benchmark trained model at an endpoint
osmosis eval -m server:agent_loop -d data.jsonl \
    --eval-fn rewards:compute_reward \
    --model my-finetuned-model --base-url http://localhost:8000/v1

# Multiple eval functions
osmosis eval -m server:agent_loop -d data.jsonl \
    --eval-fn rewards:exact_match \
    --eval-fn rewards:semantic_similarity \
    --model my-finetuned-model --base-url http://localhost:8000/v1

# pass@5 analysis
osmosis eval -m server:agent_loop -d data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model my-finetuned-model --base-url http://localhost:8000/v1

# Baseline comparison with external LLM
osmosis eval -m server:agent_loop -d data.jsonl \
    --eval-fn rewards:compute_reward --model openai/gpt-5-mini -o results.json

# Concurrent execution
osmosis eval -m server:agent_loop -d data.jsonl \
    --eval-fn rewards:compute_reward --batch-size 5 \
    --model my-finetuned-model --base-url http://localhost:8000/v1

有关完整的评估模式指南，请参阅评估模式文档。

后续步骤

API 参考

完整的 API 文档

远程 Rollout

构建自定义智能体循环

Python SDK

​安装

​全局用法

​版本

​身份认证命令

​login

​用法

​logout

​用法

​whoami

​用法

​输出

​workspace

​用法

​子命令

​示例

​评估命令

​preview

​用法

​选项

​示例

​输出

​eval-rubric

​用法

​必填选项

​可选参数

​示例

​配置文件

​评分标准配置（YAML）

​结构

​必填字段

​评分标准定义字段

​Model Info 字段

​自动发现

​数据集格式（JSONL）

​最小示例

​完整示例

​字段参考

​输出格式

​控制台输出

​JSON 输出文件

​高级用法

​基准对比

​方差分析

​批量处理

​自定义缓存位置

​错误处理

​常见错误

​API 密钥未找到

​评分标准未找到

​无效的 JSONL 格式

​模型未找到

​超时错误

​远程 Rollout 命令

​serve

​用法

​必填选项

​可选参数

​示例

​validate

​用法

​选项

​示例

​test

​用法

​必填选项

​可选参数

​示例

​交互模式命令

​eval

​用法

​必填选项

​可选参数

​示例

​后续步骤

API 参考

远程 Rollout

安装

全局用法

版本

身份认证命令

login

用法

logout

用法

whoami

用法

输出

workspace

用法

子命令

示例

评估命令

preview

用法

选项

示例

输出

eval-rubric

用法

必填选项

可选参数

示例

配置文件

评分标准配置（YAML）

结构

必填字段

评分标准定义字段

Model Info 字段

自动发现

数据集格式（JSONL）

最小示例

完整示例

字段参考

输出格式

控制台输出

JSON 输出文件

高级用法

基准对比

方差分析

批量处理

自定义缓存位置

错误处理

常见错误

API 密钥未找到

评分标准未找到

无效的 JSONL 格式

模型未找到

超时错误

远程 Rollout 命令

serve

用法

必填选项

可选参数

示例

validate

用法

选项

示例

test

用法

必填选项

可选参数

示例

交互模式命令

eval

用法

必填选项

可选参数

示例

后续步骤