测试模式 - Osmosis

在接入 Osmosis 训练集群之前，您可以使用任意云端 LLM 提供商（OpenAI、Anthropic 等）在本地端到端运行 Agent，验证行为并调试问题。

概述

osmosis test -m <module:agent> -d <dataset> [options]

测试模式：

针对数据集运行您的 Agent 循环
通过 LiteLLM 使用外部 LLM 提供商
使用您的 ground_truth 数据计算奖励
支持批量和交互式执行

数据集格式

测试数据集需要以下列：

列名	是否必需	描述
`system_prompt`	是	LLM 的系统提示词
`user_prompt`	是	启动对话的用户消息
`ground_truth`	否	用于奖励计算的预期输出

支持的格式

JSONL (.jsonl)
JSON (.json)
Parquet (.parquet)

示例数据集

test_data.jsonl：

{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 15 * 7?", "ground_truth": "105"}
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 100 / 4?", "ground_truth": "25"}
{"system_prompt": "You are a calculator. Use tools to solve math problems. Format final answer as: #### <number>", "user_prompt": "What is 8 + 13?", "ground_truth": "21"}

基本用法

设置 API 密钥

# OpenAI (default)
export OPENAI_API_KEY="your-key"

# Anthropic
export ANTHROPIC_API_KEY="your-key"

运行测试

# Default: GPT-5-mini
osmosis test -m server:agent_loop -d test_data.jsonl

# Use Claude
osmosis test -m server:agent_loop -d test_data.jsonl --model anthropic/claude-sonnet-4-5

# Use custom OpenAI-compatible API
osmosis test -m server:agent_loop -d test_data.jsonl --base-url http://localhost:8000/v1

有关完整的 CLI 选项列表，请参阅 CLI 参考。

批量模式

运行所有行并获取摘要：

osmosis test -m server:agent_loop -d test_data.jsonl

输出：

osmosis-rollout-test v0.2.13
Loading agent: server:agent_loop
  Agent name: calculator
Loading dataset: test_data.jsonl
  Total rows: 3
Initializing provider: openai
  Model: gpt-5-mini

Running tests...
[1/3] Row 0: OK (2.1s, 148 tokens)
[2/3] Row 1: OK (1.9s, 152 tokens)
[3/3] Row 2: OK (1.7s, 134 tokens)

Summary:
  Total: 3
  Passed: 3
  Failed: 0
  Duration: 5.7s
  Total tokens: 434

保存结果

osmosis test -m server:agent_loop -d test_data.jsonl -o results.json

输出 JSON 结构：

{
  "summary": {
    "total": 3,
    "passed": 3,
    "failed": 0,
    "total_duration_ms": 5700,
    "total_tokens": 434
  },
  "results": [
    {
      "row_index": 0,
      "success": true,
      "error": null,
      "duration_ms": 2100,
      "token_usage": {"total_tokens": 148},
      "reward": 1.0,
      "finish_reason": "stop"
    }
  ]
}

测试子集

# First 10 rows
osmosis test -m server:agent_loop -d test_data.jsonl --limit 10

# Skip first 50 rows, test next 20
osmosis test -m server:agent_loop -d test_data.jsonl --offset 50 --limit 20

交互模式

逐步执行 Agent 以进行调试：

osmosis test -m server:agent_loop -d test_data.jsonl --interactive

交互会话示例

=== Interactive Mode ===
Dataset: test_data.jsonl (3 rows)
Starting at row 0

--- Row 0 ---
System: You are a calculator...
User: What is 15 * 7?

[turn 0] > n
Calling LLM...
Assistant: I'll calculate 15 * 7 using the multiply tool.
Tool calls: multiply(a=15, b=7)

Executing tools...
Tool result: 105

[turn 1] > n
Calling LLM...
Assistant: The answer is:

#### 105

[turn 1] Rollout complete
  Finish reason: stop
  Reward: 1.0

Continue to next row? (y/n/q) >

有关交互模式命令和其他选项，请参阅 CLI 参考。

支持的提供商

测试模式使用 LiteLLM 提供提供商支持。非 OpenAI 提供商需要添加前缀（例如 anthropic/、gemini/、groq/）。OpenAI 模型无需前缀。

自定义 OpenAI 兼容 API

osmosis test -m server:agent_loop -d test_data.jsonl \
  --base-url http://localhost:8000/v1 \
  --model my-local-model

技巧与最佳实践

从交互模式开始

在运行批量测试之前，先使用 --interactive 来了解 Agent 的行为。

测试边界情况

包含多样化的测试用例：简单查询、多步骤问题、边界情况和潜在的失败场景。

比较模型

使用多个模型进行测试，以确保 Agent 在不同提供商之间正常工作，并通过 Token 使用量来估算训练成本。

故障排除

如果您看到 “API key not found”，请设置相应的环境变量：

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

如果您看到 “Model not found”，请检查模型名称格式。OpenAI 模型无需前缀（gpt-5.2），而其他提供商需要前缀（anthropic/claude-sonnet-4-5、gemini/gemini-3-flash-preview）。如果您看到 “No rows to test”，请验证您的数据集每行是否有有效的 JSON（对于 JSONL 格式）、所需列是否存在，以及 --offset 是否跳过了所有行。启用调试模式以获取工具错误的详细输出：

osmosis test -m server:agent_loop -d test_data.jsonl --debug

下一步

评估模式

使用 pass@k 指标评估训练模型

Agent 循环指南

学习高级 Agent 模式

CLI 参考

完整的 CLI 文档

远程 Rollout

​概述

​数据集格式

​支持的格式

​示例数据集

​基本用法

​设置 API 密钥

​运行测试

​批量模式

​保存结果

​测试子集

​交互模式

​交互会话示例

​支持的提供商

​自定义 OpenAI 兼容 API

​技巧与最佳实践

​故障排除

​下一步