Auto Arena Skill
End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:
- 1. Generate queries — LLM creates diverse test queries from task description
- Collect responses — query all target endpoints concurrently
- Generate rubrics — LLM produces evaluation criteria from task + sample queries
- Pairwise evaluation — judge model compares every model pair (with position-bias swap)
- Analyze & rank — compute win rates, win matrix, and rankings
- Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap
Prerequisites
CODEBLOCK0
Gather from user before running
| Info | Required? | Notes |
|---|
| Task description | Yes | What the models/agents should do (set in config YAML) |
| Target endpoints |
Yes | At least 2 OpenAI-compatible endpoints to compare |
| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g.
gpt-4,
qwen-max) |
| API keys | Yes | Env vars:
OPENAI_API_KEY,
DASHSCOPE_API_KEY, etc. |
| Number of queries | No | Default:
20 |
| Seed queries | No | Example queries to guide generation style |
| System prompts | No | Per-endpoint system prompts |
| Output directory | No | Default:
./evaluation_results |
| Report language | No |
"zh" (default) or
"en" |
Quick start
CLI
CODEBLOCK1
Python API
CODEBLOCK2
Minimal Python API (no config file)
CODEBLOCK3
CLI options
| Flag | Default | Description |
|---|
| INLINECODE9 | — | Path to YAML configuration file (required) |
| INLINECODE10 |
config value | Override output directory |
|
--queries_file | — | Path to pre-generated queries JSON (skip generation) |
|
--save |
False | Save results to file |
|
--fresh |
False | Start fresh, ignore checkpoint |
|
--rerun-judge |
False | Re-run pairwise evaluation only (keep queries/responses/rubrics) |
Minimal config file
CODEBLOCK4
Full config reference
task
| Field | Required | Description |
|---|
| INLINECODE18 | Yes | Clear description of the task models will be tested on |
| INLINECODE19 |
No | Usage scenario for additional context |
target_endpoints.\
| Field | Default | Description |
|---|
| INLINECODE20 | — | API base URL (required) |
| INLINECODE21 |
— | API key, supports
${ENV_VAR} (required) |
|
model | — | Model name (required) |
|
system_prompt | — | System prompt for this endpoint |
|
extra_params | — | Extra API params (e.g.
temperature,
max_tokens) |
judge_endpoint
Same fields as target_endpoints.<name>. Use a strong model (e.g. gpt-4, qwen-max) with low temperature (~0.1) for consistent judgments.
query_generation
| Field | Default | Description |
|---|
| INLINECODE31 | INLINECODE32 | Total number of queries to generate |
| INLINECODE33 |
— | Example queries to guide generation |
|
categories | — | Query categories with weights for stratified generation |
|
endpoint | judge endpoint | Custom endpoint for query generation |
|
queries_per_call |
10 | Queries generated per API call (1–50) |
|
num_parallel_batches |
3 | Parallel generation batches |
|
temperature |
0.9 | Sampling temperature (0.0–2.0) |
|
top_p |
0.95 | Top-p sampling (0.0–1.0) |
|
max_similarity |
0.85 | Dedup similarity threshold (0.0–1.0) |
|
enable_evolution |
false | Enable Evol-Instruct complexity evolution |
|
evolution_rounds |
1 | Evolution rounds (0–3) |
|
complexity_levels |
["constraints", "reasoning", "edge_cases"] | Evolution strategies |
evaluation
| Field | Default | Description |
|---|
| INLINECODE52 | INLINECODE53 | Max concurrent API requests |
| INLINECODE54 |
60 | Request timeout in seconds |
|
retry_times |
3 | Retry attempts for failed requests |
output
| Field | Default | Description |
|---|
| INLINECODE58 | INLINECODE59 | Output directory |
| INLINECODE60 |
true | Save generated queries |
|
save_responses |
true | Save model responses |
|
save_details |
true | Save detailed results |
report
| Field | Default | Description |
|---|
| INLINECODE66 | INLINECODE67 | Enable Markdown report generation |
| INLINECODE68 |
"zh" | Report language:
"zh" or
"en" |
|
include_examples |
3 | Examples per section (1–10) |
|
chart.enabled |
true | Generate win-rate chart |
|
chart.orientation |
"horizontal" |
"horizontal" or
"vertical" |
|
chart.show_values |
true | Show values on bars |
|
chart.highlight_best |
true | Highlight best model |
|
chart.matrix_enabled |
false | Generate win-rate matrix heatmap |
|
chart.format |
"png" | Chart format:
"png",
"svg", or
"pdf" |
Interpreting results
Win rate: percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.
Rankings example:
CODEBLOCK5
Win matrix: win_matrix[A][B] = how often model A beats model B across all queries.
Checkpoint & resume
The pipeline saves progress after each step. Interrupted runs resume automatically:
- -
--fresh — ignore checkpoint, start from scratch - INLINECODE93 — re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact
- Adding new endpoints to config triggers incremental response collection; existing responses are preserved
Output files
CODEBLOCK6
API key by model
| Model prefix | Environment variable |
|---|
INLINECODE94 , o1-*, INLINECODE96 | INLINECODE97 |
| INLINECODE98 |
ANTHROPIC_API_KEY |
|
qwen-*,
dashscope/* |
DASHSCOPE_API_KEY |
|
deepseek-* |
DEEPSEEK_API_KEY |
| Custom endpoint | set
api_key +
base_url in config |
Additional resources
Auto Arena 技能
使用 OpenJudge AutoArenaPipeline 进行端到端自动化模型对比:
- 1. 生成查询 — LLM 根据任务描述生成多样化的测试查询
- 收集响应 — 并发查询所有目标端点
- 生成评分标准 — LLM 根据任务和示例查询生成评估标准
- 成对评估 — 评判模型比较每一对模型(含位置偏差交换)
- 分析与排名 — 计算胜率、胜率矩阵和排名
- 报告与图表 — Markdown 报告 + 胜率柱状图 + 可选矩阵热力图
前置条件
bash
安装 OpenJudge
pip install py-openjudge
auto_arena 的额外依赖(图表生成)
pip install matplotlib
运行前需从用户处收集的信息
| 信息 | 是否必需 | 说明 |
|---|
| 任务描述 | 是 | 模型/智能体应执行的任务(在 YAML 配置中设置) |
| 目标端点 |
是 | 至少 2 个兼容 OpenAI 的端点进行比较 |
| 评判端点 | 是 | 用于成对评估的强模型(例如 gpt-4、qwen-max) |
| API 密钥 | 是 | 环境变量:OPENAI
APIKEY、DASHSCOPE
APIKEY 等 |
| 查询数量 | 否 | 默认值:20 |
| 种子查询 | 否 | 用于指导生成风格的示例查询 |
| 系统提示词 | 否 | 每个端点的系统提示词 |
| 输出目录 | 否 | 默认值:./evaluation_results |
| 报告语言 | 否 | zh(默认)或 en |
快速开始
CLI
bash
运行评估
python -m cookbooks.auto_arena --config config.yaml --save
使用预生成的查询
python -m cookbooks.auto_arena --config config.yaml \
--queries_file queries.json --save
全新开始,忽略检查点
python -m cookbooks.auto_arena --config config.yaml --fresh --save
仅使用新的评判模型重新运行成对评估
(保留查询、响应和评分标准)
python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save
Python API
python
import asyncio
from cookbooks.autoarena.autoarena_pipeline import AutoArenaPipeline
async def main():
pipeline = AutoArenaPipeline.from_config(config.yaml)
result = await pipeline.evaluate()
print(f最佳模型: {result.best_pipeline})
for rank, (model, win_rate) in enumerate(result.rankings, 1):
print(f{rank}. {model}: {win_rate:.1%})
asyncio.run(main())
最小化 Python API(无需配置文件)
python
import asyncio
from cookbooks.autoarena.autoarena_pipeline import AutoArenaPipeline
from cookbooks.auto_arena.schema import OpenAIEndpoint
async def main():
pipeline = AutoArenaPipeline(
task_description=电商客服聊天机器人,
target_endpoints={
gpt4: OpenAIEndpoint(
base_url=https://api.openai.com/v1,
api_key=sk-...,
model=gpt-4,
),
qwen: OpenAIEndpoint(
base_url=https://dashscope.aliyuncs.com/compatible-mode/v1,
api_key=sk-...,
model=qwen-max,
),
},
judge_endpoint=OpenAIEndpoint(
base_url=https://api.openai.com/v1,
api_key=sk-...,
model=gpt-4,
),
num_queries=20,
)
result = await pipeline.evaluate()
print(f最佳: {result.best_pipeline})
asyncio.run(main())
CLI 选项
| 标志 | 默认值 | 描述 |
|---|
| --config | — | YAML 配置文件路径(必需) |
| --output_dir |
配置值 | 覆盖输出目录 |
| --queries_file | — | 预生成查询 JSON 文件路径(跳过生成) |
| --save | False | 将结果保存到文件 |
| --fresh | False | 全新开始,忽略检查点 |
| --rerun-judge | False | 仅重新运行成对评估(保留查询/响应/评分标准) |
最小化配置文件
yaml
task:
description: 用于研究和写作任务的学术 GPT 助手
target_endpoints:
model_v1:
base_url: https://api.openai.com/v1
apikey: ${OPENAIAPI_KEY}
model: gpt-4
model_v2:
base_url: https://api.openai.com/v1
apikey: ${OPENAIAPI_KEY}
model: gpt-3.5-turbo
judge_endpoint:
base_url: https://api.openai.com/v1
apikey: ${OPENAIAPI_KEY}
model: gpt-4
完整配置参考
task
| 字段 | 必需 | 描述 |
|---|
| description | 是 | 模型将接受测试的任务的清晰描述 |
| scenario |
否 | 用于额外上下文的场景 |
target_endpoints.\
| 字段 | 默认值 | 描述 |
|---|
| baseurl | — | API 基础 URL(必需) |
| apikey |
— | API 密钥,支持 ${ENV_VAR}(必需) |
| model | — | 模型名称(必需) |
| system_prompt | — | 此端点的系统提示词 |
| extra
params | — | 额外 API 参数(例如 temperature、maxtokens) |
judge_endpoint
与 target_endpoints. 字段相同。使用强模型(例如 gpt-4、qwen-max),温度设为较低值(~0.1)以获得一致的评判。
query_generation
| 字段 | 默认值 | 描述 |
|---|
| numqueries | 20 | 生成的查询总数 |
| seedqueries |
— | 用于指导生成风格的示例查询 |
| categories | — | 带权重的查询类别,用于分层生成 |
| endpoint | 评判端点 | 用于查询生成的自定义端点 |
| queries
percall | 10 | 每次 API 调用生成的查询数(1–50) |
| num
parallelbatches | 3 | 并行生成批次 |
| temperature | 0.9 | 采样温度(0.0–2.0) |
| top_p | 0.95 | Top-p 采样(0.0–1.0) |
| max_similarity | 0.85 | 去重相似度阈值(0.0–1.0) |
| enable_evolution | false | 启用 Evol-Instruct 复杂度进化 |
| evolution_rounds | 1 | 进化轮次(0–3) |
| complexity
levels | [constraints, reasoning, edgecases] | 进化策略 |
evaluation
| 字段 | 默认值 | 描述 |
|---|
| max_concurrency | 10 | 最大并发 API 请求数 |
| timeout |
60 | 请求超时时间(秒) |
| retry_times | 3 | 失败请求的重试次数 |
output
| 字段 | 默认值 | 描述 |
|---|
| outputdir | ./evaluationresults | 输出目录 |
| save_queries |
true | 保存生成的查询 |
| save_responses | true | 保存模型响应 |
| save_details | true | 保存详细结果 |
report
| 字段 | 默认值 | 描述 |
|---|
| enabled | false | 启用 Markdown 报告生成 |
| language |
zh | 报告语言:zh 或 en |
| include_examples | 3 | 每节示例数(1–10) |
| chart.enabled | true | 生成胜率图表 |
| chart.orientation | horizontal |