评估模式

训练完成后，使用评估模式通过自定义评估函数、统计分析和 pass@k 指标，对数据集运行 RolloutAgentLoop 来基准测试您的模型。

概述

osmosis eval -m <module:agent> -d <dataset> --model <model> --eval-fn <module:fn> [options]

评估模式：

基准测试部署在 OpenAI 兼容端点的训练模型
使用自定义评估函数对 Agent 输出打分
支持多次运行的 pass@k 分析
通过 --batch-size 实现并发执行以加速基准测试
可复用现有的 @osmosis_reward 函数作为评估函数

osmosis eval 用于使用自定义评估函数评估 Agent 性能。如需基于评分标准评估 JSONL 对话，请使用 osmosis eval-rubric。

快速开始

以下示例基于 osmosis-remote-rollout-example 仓库，您可以直接克隆并运行：

git clone https://github.com/Osmosis-AI/osmosis-remote-rollout-example.git
cd osmosis-remote-rollout-example && uv sync

基准测试训练模型

连接到您的模型服务端点并评估：

# 评估部署在端点的训练模型
# 将 <your-model> 替换为您服务端点中注册的模型名称
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward \
    --model <your-model> \
    --base-url http://localhost:8000/v1

# 使用多个评估函数
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:exact_match \
    --eval-fn rewards:partial_match \
    --model <your-model> \
    --base-url http://localhost:8000/v1

使用 LiteLLM 进行基准对比

使用 LiteLLM 格式与外部 LLM 提供商进行基准对比。以下命令可以在示例仓库中直接运行，只需设置 API 密钥：

export OPENAI_API_KEY="your-key"

# 以 GPT-5-mini 作为基线对比
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --model openai/gpt-5-mini

# 以 Claude 对比
export ANTHROPIC_API_KEY="your-key"
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --model anthropic/claude-sonnet-4-5

数据集格式

评估模式使用与测试模式相同的数据集格式：

列名	是否必需	描述
`system_prompt`	是	LLM 的系统提示词
`user_prompt`	是	启动对话的用户消息
`ground_truth`	否	用于评估函数打分的预期输出

额外的列将通过 metadata（完整模式）或 extra_info（简单模式）传递给评估函数。支持的格式：.jsonl、.json、.parquet

评估函数

评估函数对 Agent 输出打分。支持两种函数签名，通过第一个参数名自动检测。

简单模式（兼容 `@osmosis_reward`）

当您只需要最终的 Assistant 回复时使用：

def exact_match(solution_str: str, ground_truth: str, extra_info: dict = None, **kwargs) -> float:
    """Score based on the last assistant message content."""
    return 1.0 if solution_str.strip() == ground_truth.strip() else 0.0

第一个参数必须命名为 solution_str
solution_str 从最后一条 Assistant 消息中提取
无需修改即可兼容现有的 @osmosis_reward 函数

完整模式

当您需要完整的对话历史时使用：

def conversation_quality(messages: list, ground_truth: str, metadata: dict, **kwargs) -> float:
    """Score based on the full conversation."""
    assistant_messages = [m for m in messages if m["role"] == "assistant"]
    return min(1.0, len(assistant_messages) / 3)

第一个参数必须命名为 messages
接收 Agent 运行的完整消息列表
同时支持同步和异步函数

pass@k 分析

当 --n 大于 1 时，评估模式会对每个数据集行运行多次，并计算 pass@k 指标。 pass@k 估计从 n 次总运行中随机选择 k 个样本，至少有一个通过（得分 >= 阈值）的概率。

# pass@k with 5 runs per row
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model <your-model> --base-url http://localhost:8000/v1

# Custom pass threshold (default is 1.0)
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 --pass-threshold 0.5 \
    --model <your-model> --base-url http://localhost:8000/v1

并发执行

使用 --batch-size 并行运行多个请求：

# Run 5 concurrent requests
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward \
    --model <your-model> --base-url http://localhost:8000/v1 \
    --batch-size 5

# Combine with pass@k — 10 runs per row, 5 concurrent
osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 10 --batch-size 5 \
    --model <your-model> --base-url http://localhost:8000/v1

模型端点

评估模式支持任何 OpenAI 兼容的服务端点，通过 --base-url 指定：

服务平台	示例 `--base-url`
vLLM	`http://localhost:8000/v1`
SGLang	`http://localhost:30000/v1`
Ollama	`http://localhost:11434/v1`
任何 OpenAI 兼容 API	`http://<host>:<port>/v1`

--model 参数应与服务端点中注册的模型名称一致。

输出格式

使用 -o 保存结果：

osmosis eval -m server:agent_loop -d test_data.jsonl \
    --eval-fn rewards:compute_reward --n 5 \
    --model <your-model> --base-url http://localhost:8000/v1 \
    -o results.json

{
  "config": {
    "model": "<your-model>",
    "n_runs": 5,
    "pass_threshold": 1.0,
    "eval_fns": ["rewards:compute_reward"]
  },
  "summary": {
    "total_rows": 100,
    "total_runs": 500,
    "eval_fns": {
      "rewards:compute_reward": {
        "mean": 0.72,
        "std": 0.45,
        "min": 0.0,
        "max": 1.0,
        "pass_at_1": 0.72,
        "pass_at_3": 0.94,
        "pass_at_5": 0.98
      }
    },
    "total_tokens": 625000,
    "total_duration_ms": 230500
  },
  "rows": [
    {
      "row_index": 0,
      "runs": [
        {
          "run_index": 0,
          "success": true,
          "scores": {"rewards:compute_reward": 1.0},
          "duration_ms": 450,
          "tokens": 200
        }
      ]
    }
  ]
}

有关完整的 CLI 选项列表，请参阅 CLI 参考。

技巧与最佳实践

从 --n 1 开始

在运行昂贵的 pass@k 分析之前，先验证您的评估函数是否正常工作。

复用 @osmosis_reward 函数

现有的 reward 函数可以在简单模式下直接作为评估函数使用，无需修改。

使用多个评估函数

在单次运行中评估不同的质量维度 —— 正确性、效率、格式合规性。

与基线对比

使用 LiteLLM 提供商（例如 --model openai/gpt-5-mini）运行相同的基准测试，以建立基线性能。

调整 --batch-size 以提高吞吐量

并发执行可以显著减少总耗时。从适中的值（例如 5）开始，根据端点容量逐步增加。

远程 Rollout

概述

快速开始

基准测试训练模型

使用 LiteLLM 进行基准对比

数据集格式

评估函数

简单模式（兼容 `@osmosis_reward`）

完整模式

pass@k 分析

并发执行

模型端点

输出格式

技巧与最佳实践

下一步

测试模式

CLI 参考

远程 Rollout

​概述

​快速开始

​基准测试训练模型

​使用 LiteLLM 进行基准对比

​数据集格式

​评估函数

​简单模式（兼容 @osmosis_reward）

​完整模式

​pass@k 分析

​并发执行

​模型端点

​输出格式

​技巧与最佳实践

​下一步

测试模式

CLI 参考

概述

快速开始

基准测试训练模型

使用 LiteLLM 进行基准对比

数据集格式

评估函数

简单模式（兼容 `@osmosis_reward`）

完整模式

pass@k 分析

并发执行

模型端点

输出格式

技巧与最佳实践

下一步