AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

Commands

When the user says any of these, follow the corresponding instructions:

- /benchmark — Run the full benchmark suite (all 40 tasks)
/benchmark --fast — Run only easy+medium tasks (19 tasks)
/benchmark --suite <name> — Run one domain only
/benchmark --task <id> — Run a single task
/benchmark --strict — Tag results as externally verified scoring
/benchmark-list — List all tasks grouped by domain
/benchmark-results — Show results from previous runs
/benchmark-compare — Compare two runs side-by-side

Flags are combinable: INLINECODE8

Running a Benchmark

Step 1: Discover Tasks

Read task.yaml files from the tasks/ directory in this skill:

CODEBLOCK0

Each task.yaml contains: name, id, suite, difficulty, mode, usermessage, inputfiles, expectedoutputs, expectedmetrics, scoring weights.

Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if --fast was specified, otherwise "full".

List discovered tasks with count and suites.

Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: INLINECODE15

Read suite_version from skill.json in this skill directory.

Create the results directory:
CODEBLOCK1

Announce: INLINECODE18

Step 3: Execute Each Task

For each task:

1. Set up workspace:

- Create /tmp/agentbench-task-{task-id}/ as workspace - Copy input files from tasks/{suite}/{task}/inputs/ to the workspace (if inputs/ exists) - If the task directory contains a setup.sh: run bash tasks/{suite}/{task}/setup.sh {workspace-path} - For file-unchanged validators: compute checksums of specified files after setup, before task execution

2. Announce: INLINECODE24

3. Record start time (milliseconds): INLINECODE25

4. Execute the task yourself directly:

- Read the task's user_message and execute it as if a real user sent you the request - Work ONLY within the workspace directory - If input files are listed, read them from the workspace - Execute naturally — use the appropriate tools (read, write, edit, exec, websearch, webfetch, etc.) - Create any output files in the workspace directory - When done, write a brief execution-trace.md to the workspace: - What you understood the task to be - What approach you took - What files you created or modified - Any difficulties or decisions you made

5. Record end time and compute duration

6. Collect metrics:

- total_time_ms: end - start - tool_calls_total: count how many tool calls you made during this task - errors: count any tool call failures - planning_ratio: estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)

7. Layer 0 — Automated Structural Checks (compute directly):

After task execution, check the workspace. For each entry in expected_outputs: - file-exists: Check if file exists. 30 points if found, 0 if not. - content-contains: Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points. - word-count-range: Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0. - git-log-contains: Check git log for expected strings. 30 points if all found, proportional otherwise. - directory-structure: Check all paths exist. 30 points if all present, proportional for partial. - command-output-contains: Run command, check output contains all strings. 30 points if match, 0 if not. - file-unchanged: Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified. - link-consistency: Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed. - Normalize total to 0-100.

8. Layer 1 — Metrics Analysis (compute directly):

If task has expected_metrics: - Tool calls within expected range: 40 points - Tool calls within 2x range: 20 points - Outside 2x range: 0 points - Planning ratio within expected range: 30 points - Planning ratio outside but within 2x: 15 points - Way off: 0 points - Zero errors: 30 points - 1-2 errors: 15 points - 3+ errors: 0 points - Normalize to 0-100. If no metrics available, score as 50. - Token estimate is tracked for reporting but NOT scored.

9. Layer 2 — Behavioral Analysis (self-evaluate honestly, 0-100):

Score based on HOW you executed:

Instruction Adherence (30 points):
- 30: Followed all instructions precisely
- 20: Mostly followed, minor deviations
- 10: Significant deviations
- 0: Ignored or misunderstood

Tool Appropriateness (25 points) — rule-based first:
- Penalty: -10 for each use of exec cat instead of read to read files
- Penalty: -10 for each use of exec echo/printf instead of write to create files
- Penalty: -5 for each use of exec sed/awk instead of edit for file edits
- Start at 25, apply penalties, floor at 0

Approach Quality (25 points) — check read-before-write:
- 25: Read all inputs before producing output
- 15: Read most inputs, minor gaps
- 5: Started producing output without reading context
- 0: No clear approach

Error Recovery (20 points):
- 20: Clean recovery or no errors occurred
- 10: Partial recovery
- 0: Failed to recover

10. Layer 3 — Output Quality (self-evaluate honestly, 0-100):

Score the deliverable:

Completeness (25): All requirements met? Gaps?
Accuracy (25): Content correct? Calculations right?
Formatting (25): Well-structured? Correct file format?
Polish (25): Would a user be satisfied?

11. Compute composite score:

    score = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)

Use weights from task.yaml if specified, otherwise these defaults.

12. Save task result to agentbench-results/{run-id}/{task-id}/:

- scores.json: All layer scores, composite, breakdown, notes - metrics.json: Timing, tool calls, errors, planning ratio - Copy output files

13. Display: INLINECODE50

Step 4: Generate Report

After all tasks:

1. Compute domain averages (group by suite, average composite scores)
Compute overall score (average of domain scores — equal domain weighting)
Compute aggregate metrics

Generate three files in agentbench-results/{run-id}/:

results.json — Machine-readable with this structure:
CODEBLOCK3

If --strict was used, set scoring_method to "externally-verified".

Integrity signature: After building results.json (without signature field), compute:

SIG=$(echo -n "$CONTENT" | openssl dgst -sha256 -hmac "agentbench-v1-{run_id}-{suite_version}-integrity" | awk '{print $2}')

Add as "signature" field to results.json.

report.md — Markdown summary: Overall Score, Metrics, Domain Breakdown, Task Details, Top Failures, Recommendations.

report.html — Self-contained HTML dashboard (inline CSS/JS, no external deps):

- Score display with color (green 80+, yellow 60-79, red <60)
Domain cards with score bars
Task detail table (sortable, expandable)
Top failures section
Dark mode via prefers-color-scheme
Footer: "Generated by AgentBench v1.0.0 (OpenClaw) | Suite v{suite_version} | Profile: {profile}"

Step 5: Present Results

1. Display overall score
Show domain breakdown
Tell user where results are saved
Mention they can submit to https://www.agentbench.app/submit

Step 6: Clean Up

Run teardown.sh if present. Remove temp workspace directories unless --keep-workspace was specified.

Listing Tasks (`/benchmark-list`)

Read all task.yaml files, group by suite, display as:
CODEBLOCK5

Viewing Results (`/benchmark-results`)

List all directories in agentbench-results/, show run ID, date, overall score, profile, and task count for each.

Comparing Runs (`/benchmark-compare`)

Show two runs side-by-side: overall scores, domain scores, and per-task deltas. Warn if profiles differ.

Key Differences from Claude Code Version

- No hooks — metrics are self-tracked (timing, tool call counting)
No subagents — you execute tasks directly in sequence
Same tasks, same scoring, same output format — results are cross-platform comparable
Same integrity signature — submissions work on the same leaderboard

Important Notes

- Be honest in self-evaluation (L2/L3). Inflated scores are obvious on the leaderboard.
The objective layers (L0 + L1) carry 55% of the weight — they can't be faked.
Token estimates are informational only, not scored.
Any link syntax is accepted in skill graph tasks — consistency is what's scored.

AgentBench for OpenClaw

在7个领域的40个真实世界任务中，对你的OpenClaw智能体通用能力进行基准测试。

命令

当用户说出以下任一命令时，请执行相应指令：

- /benchmark — 运行完整基准测试套件（全部40个任务）
/benchmark --fast — 仅运行简单+中等难度任务（19个任务）
/benchmark --suite <名称> — 仅运行单个领域
/benchmark --task — 运行单个任务
/benchmark --strict — 将结果标记为外部验证评分
/benchmark-list — 按领域列出所有任务
/benchmark-results — 显示之前运行的结果
/benchmark-compare — 并排比较两次运行

标志可组合使用：/benchmark --fast --suite research

运行基准测试

第一步：发现任务

读取本技能中 tasks/ 目录下的 task.yaml 文件：

tasks/{套件名称}/{任务名称}/task.yaml

每个 task.yaml 包含：名称、ID、套件、难度、模式、用户消息、输入文件、预期输出、预期指标、评分权重。

如果指定了 --suite 或 --task 则进行筛选。如果设置了 --fast 且未指定 --task，则仅筛选难度为简单或中等的任务。

如果指定了 --fast，则配置文件为快速，否则为完整。

显示已发现的任务数量及套件列表。

第二步：设置运行目录

根据当前时间戳生成运行ID：YYYYMMDD-HHmmss

从本技能目录中的 skill.json 读取 suite_version。

创建结果目录：

agentbench-results/{运行ID}/

宣布：开始 AgentBench 运行 {运行ID} | 配置文件：{配置文件} | 套件版本：{套件版本} | 任务数：{数量}

第三步：执行每个任务

对于每个任务：

1. 设置工作区：

- 创建 /tmp/agentbench-task-{任务ID}/ 作为工作区 - 将输入文件从 tasks/{套件}/{任务}/inputs/ 复制到工作区（如果 inputs/ 存在） - 如果任务目录包含 setup.sh：运行 bash tasks/{套件}/{任务}/setup.sh {工作区路径} - 对于 file-unchanged 验证器：在设置后、任务执行前计算指定文件的校验和

2. 宣布：正在运行：{任务名称} [{任务套件}]（难度：{任务难度}）

3. 记录开始时间（毫秒）：date +%s%3N

4. 直接自行执行任务：

- 读取任务的 user_message 并像真实用户发送请求一样执行 - 仅在工作区目录内操作 - 如果列出了输入文件，则从工作区读取 - 自然执行——使用适当的工具（读取、写入、编辑、执行、网络搜索、网络获取等） - 在工作区目录中创建任何输出文件 - 完成后，在工作区中写入简短的 execution-trace.md： - 你对任务的理解 - 你采取的方法 - 你创建或修改的文件 - 遇到的困难或做出的决定

5. 记录结束时间并计算持续时间

6. 收集指标：

- totaltimems：结束时间 - 开始时间 - toolcallstotal：统计在此任务期间进行的工具调用次数 - errors：统计任何工具调用失败 - planning_ratio：估计阅读/思考时间与产生输出时间的比例（近似即可）

7. 第0层——自动化结构检查（直接计算）：

任务执行后，检查工作区。对于 expected_outputs 中的每个条目： - file-exists：检查文件是否存在。找到得30分，未找到得0分。 - content-contains：读取文件，检查每个必需的部分关键词（不区分大小写）。分数与找到的匹配数成比例。总分池：40分。 - word-count-range：统计字数。在范围内=30分。在2倍范围内=15分。超出=0分。 - git-log-contains：检查git日志中是否包含预期字符串。全部找到得30分，部分找到按比例得分。 - directory-structure：检查所有路径是否存在。全部存在得30分，部分存在按比例得分。 - command-output-contains：运行命令，检查输出是否包含所有字符串。匹配得30分，不匹配得0分。 - file-unchanged：将校验和与执行前校验和比较。未更改得30分，已修改得0分。 - link-consistency：扫描文件中的链接语法一致性。一致得30分，大部分一致（>70%使用一种风格）得15分，混合得0分。 - 将总分归一化到0-100。

8. 第1层——指标分析（直接计算）：

如果任务有预期指标： - 工具调用在预期范围内：40分 - 工具调用在2倍范围内：20分 - 超出2倍范围：0分 - 规划比例在预期范围内：30分 - 规划比例超出但在2倍范围内：15分 - 严重偏离：0分 - 零错误：30分 - 1-2个错误：15分 - 3个以上错误：0分 - 归一化到0-100。如果没有可用指标，评分为50分。 - 令牌估算用于报告但不计分。

9. 第2层——行为分析（诚实自评，0-100）：

根据执行方式进行评分：

指令遵循度（30分）：
- 30：精确遵循所有指令
- 20：基本遵循，有轻微偏差
- 10：显著偏差
- 0：忽略或误解

工具适用性（25分）——基于规则优先：
- 惩罚：每次使用 exec cat 代替 read 读取文件扣10分
- 惩罚：每次使用 exec echo/printf 代替 write 创建文件扣10分
- 惩罚：每次使用 exec sed/awk 代替 edit 编辑文件扣5分
- 从25分开始，应用惩罚，最低0分

方法质量（25分）——检查先读后写：
- 25：在产生输出前读取了所有输入
- 15：读取了大部分输入，有少量遗漏
- 5：未读取上下文就开始产生输出
- 0：没有明确的方法

错误恢复（20分）：
- 20：干净恢复或未发生错误
- 10：部分恢复
- 0：未能恢复

10. 第3层——输出质量（诚实自评，0-100）：

对交付物进行评分：

完整性（25分）： 满足所有要求？有无遗漏？
准确性（25分）： 内容正确？计算准确？
格式（25分）： 结构良好？文件格式正确？
精良度（25分）： 用户会满意吗？

11. 计算综合得分：

得分 = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)

如果 task.yaml 中指定了权重则使用，否则使用这些默认值。

12. 保存任务结果到 agentbench-results/{运行ID}/{任务ID}/：

- scores.json：所有层级得分、综合得分、细分、备注 - metrics.json：计时、工具调用、错误、规划比例 - 复制输出文件

13. 显示：{任务名称}：{综合得分}/100（L0：{l0} L1：{l1} L2：{l2} L3：{l3}）

第四步：生成报告

所有任务完成后：

1. 计算领域平均值（按套件分组，平均综合得分）
计算总体得分（领域得分的平均值——领域权重相等）
计算汇总指标

在 agentbench-results/{运行ID}/ 中生成三个文件：

results.json — 机器可读，结构如下：
json
{
run_id: 20260222-143022,
timestamp: 2026-02-22T14:30:22Z,
platform: openclaw,
mode

agentbench智能体基准测试

agentbench

AgentBench for OpenClaw

Commands