AgentBench for OpenClaw
Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.
Commands
When the user says any of these, follow the corresponding instructions:
- -
/benchmark — Run the full benchmark suite (all 40 tasks) /benchmark --fast — Run only easy+medium tasks (19 tasks)/benchmark --suite <name> — Run one domain only/benchmark --task <id> — Run a single task/benchmark --strict — Tag results as externally verified scoring/benchmark-list — List all tasks grouped by domain/benchmark-results — Show results from previous runs/benchmark-compare — Compare two runs side-by-side
Flags are combinable: INLINECODE8
Running a Benchmark
Step 1: Discover Tasks
Read task.yaml files from the tasks/ directory in this skill:
CODEBLOCK0
Each task.yaml contains: name, id, suite, difficulty, mode, usermessage, inputfiles, expectedoutputs, expectedmetrics, scoring weights.
Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".
Profile is "fast" if --fast was specified, otherwise "full".
List discovered tasks with count and suites.
Step 2: Set Up Run Directory
Generate a run ID from the current timestamp: INLINECODE15
Read suite_version from skill.json in this skill directory.
Create the results directory:
CODEBLOCK1
Announce: INLINECODE18
Step 3: Execute Each Task
For each task:
- 1. Set up workspace:
- Create
/tmp/agentbench-task-{task-id}/ as workspace
- Copy input files from
tasks/{suite}/{task}/inputs/ to the workspace (if inputs/ exists)
- If the task directory contains a
setup.sh: run
bash tasks/{suite}/{task}/setup.sh {workspace-path}
- For
file-unchanged validators: compute checksums of specified files after setup, before task execution
- 2. Announce: INLINECODE24
- 3. Record start time (milliseconds): INLINECODE25
- 4. Execute the task yourself directly:
- Read the task's
user_message and execute it as if a real user sent you the request
- Work ONLY within the workspace directory
- If input files are listed, read them from the workspace
- Execute naturally — use the appropriate tools (read, write, edit, exec, web
search, webfetch, etc.)
- Create any output files in the workspace directory
- When done, write a brief
execution-trace.md to the workspace:
- What you understood the task to be
- What approach you took
- What files you created or modified
- Any difficulties or decisions you made
- 5. Record end time and compute duration
- 6. Collect metrics:
-
total_time_ms: end - start
-
tool_calls_total: count how many tool calls you made during this task
-
errors: count any tool call failures
-
planning_ratio: estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)
- 7. Layer 0 — Automated Structural Checks (compute directly):
After task execution, check the workspace. For each entry in
expected_outputs:
-
file-exists: Check if file exists. 30 points if found, 0 if not.
-
content-contains: Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points.
-
word-count-range: Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0.
-
git-log-contains: Check git log for expected strings. 30 points if all found, proportional otherwise.
-
directory-structure: Check all paths exist. 30 points if all present, proportional for partial.
-
command-output-contains: Run command, check output contains all strings. 30 points if match, 0 if not.
-
file-unchanged: Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified.
-
link-consistency: Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed.
- Normalize total to 0-100.
- 8. Layer 1 — Metrics Analysis (compute directly):
If task has expected_metrics:
- Tool calls within expected range: 40 points
- Tool calls within 2x range: 20 points
- Outside 2x range: 0 points
- Planning ratio within expected range: 30 points
- Planning ratio outside but within 2x: 15 points
- Way off: 0 points
- Zero errors: 30 points
- 1-2 errors: 15 points
- 3+ errors: 0 points
- Normalize to 0-100. If no metrics available, score as 50.
- Token estimate is tracked for reporting but NOT scored.
- 9. Layer 2 — Behavioral Analysis (self-evaluate honestly, 0-100):
Score based on HOW you executed:
Instruction Adherence (30 points):
- 30: Followed all instructions precisely
- 20: Mostly followed, minor deviations
- 10: Significant deviations
- 0: Ignored or misunderstood
Tool Appropriateness (25 points) — rule-based first:
- Penalty: -10 for each use of exec cat instead of read to read files
- Penalty: -10 for each use of exec echo/printf instead of write to create files
- Penalty: -5 for each use of exec sed/awk instead of edit for file edits
- Start at 25, apply penalties, floor at 0
Approach Quality (25 points) — check read-before-write:
- 25: Read all inputs before producing output
- 15: Read most inputs, minor gaps
- 5: Started producing output without reading context
- 0: No clear approach
Error Recovery (20 points):
- 20: Clean recovery or no errors occurred
- 10: Partial recovery
- 0: Failed to recover
- 10. Layer 3 — Output Quality (self-evaluate honestly, 0-100):
Score the deliverable:
Completeness (25): All requirements met? Gaps?
Accuracy (25): Content correct? Calculations right?
Formatting (25): Well-structured? Correct file format?
Polish (25): Would a user be satisfied?
- 11. Compute composite score:
score = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)
Use weights from task.yaml if specified, otherwise these defaults.
- 12. Save task result to
agentbench-results/{run-id}/{task-id}/:
-
scores.json: All layer scores, composite, breakdown, notes
-
metrics.json: Timing, tool calls, errors, planning ratio
- Copy output files
- 13. Display: INLINECODE50
Step 4: Generate Report
After all tasks:
- 1. Compute domain averages (group by suite, average composite scores)
- Compute overall score (average of domain scores — equal domain weighting)
- Compute aggregate metrics
Generate three files in agentbench-results/{run-id}/:
results.json — Machine-readable with this structure:
CODEBLOCK3
If --strict was used, set scoring_method to "externally-verified".
Integrity signature: After building results.json (without signature field), compute:
SIG=$(echo -n "$CONTENT" | openssl dgst -sha256 -hmac "agentbench-v1-{run_id}-{suite_version}-integrity" | awk '{print $2}')
Add as
"signature" field to results.json.
report.md — Markdown summary: Overall Score, Metrics, Domain Breakdown, Task Details, Top Failures, Recommendations.
report.html — Self-contained HTML dashboard (inline CSS/JS, no external deps):
- - Score display with color (green 80+, yellow 60-79, red <60)
- Domain cards with score bars
- Task detail table (sortable, expandable)
- Top failures section
- Dark mode via prefers-color-scheme
- Footer: "Generated by AgentBench v1.0.0 (OpenClaw) | Suite v{suite_version} | Profile: {profile}"
Step 5: Present Results
- 1. Display overall score
- Show domain breakdown
- Tell user where results are saved
- Mention they can submit to https://www.agentbench.app/submit
Step 6: Clean Up
Run teardown.sh if present. Remove temp workspace directories unless --keep-workspace was specified.
Listing Tasks (/benchmark-list)
Read all task.yaml files, group by suite, display as:
CODEBLOCK5
Viewing Results (/benchmark-results)
List all directories in agentbench-results/, show run ID, date, overall score, profile, and task count for each.
Comparing Runs (/benchmark-compare)
Show two runs side-by-side: overall scores, domain scores, and per-task deltas. Warn if profiles differ.
Key Differences from Claude Code Version
- - No hooks — metrics are self-tracked (timing, tool call counting)
- No subagents — you execute tasks directly in sequence
- Same tasks, same scoring, same output format — results are cross-platform comparable
- Same integrity signature — submissions work on the same leaderboard
Important Notes
- - Be honest in self-evaluation (L2/L3). Inflated scores are obvious on the leaderboard.
- The objective layers (L0 + L1) carry 55% of the weight — they can't be faked.
- Token estimates are informational only, not scored.
- Any link syntax is accepted in skill graph tasks — consistency is what's scored.
AgentBench for OpenClaw
在7个领域的40个真实世界任务中,对你的OpenClaw智能体通用能力进行基准测试。
命令
当用户说出以下任一命令时,请执行相应指令:
- - /benchmark — 运行完整基准测试套件(全部40个任务)
- /benchmark --fast — 仅运行简单+中等难度任务(19个任务)
- /benchmark --suite <名称> — 仅运行单个领域
- /benchmark --task — 运行单个任务
- /benchmark --strict — 将结果标记为外部验证评分
- /benchmark-list — 按领域列出所有任务
- /benchmark-results — 显示之前运行的结果
- /benchmark-compare — 并排比较两次运行
标志可组合使用:/benchmark --fast --suite research
运行基准测试
第一步:发现任务
读取本技能中 tasks/ 目录下的 task.yaml 文件:
tasks/{套件名称}/{任务名称}/task.yaml
每个 task.yaml 包含:名称、ID、套件、难度、模式、用户消息、输入文件、预期输出、预期指标、评分权重。
如果指定了 --suite 或 --task 则进行筛选。如果设置了 --fast 且未指定 --task,则仅筛选难度为简单或中等的任务。
如果指定了 --fast,则配置文件为快速,否则为完整。
显示已发现的任务数量及套件列表。
第二步:设置运行目录
根据当前时间戳生成运行ID:YYYYMMDD-HHmmss
从本技能目录中的 skill.json 读取 suite_version。
创建结果目录:
agentbench-results/{运行ID}/
宣布:开始 AgentBench 运行 {运行ID} | 配置文件:{配置文件} | 套件版本:{套件版本} | 任务数:{数量}
第三步:执行每个任务
对于每个任务:
- 1. 设置工作区:
- 创建 /tmp/agentbench-task-{任务ID}/ 作为工作区
- 将输入文件从 tasks/{套件}/{任务}/inputs/ 复制到工作区(如果 inputs/ 存在)
- 如果任务目录包含 setup.sh:运行 bash tasks/{套件}/{任务}/setup.sh {工作区路径}
- 对于 file-unchanged 验证器:在设置后、任务执行前计算指定文件的校验和
- 2. 宣布:正在运行:{任务名称} [{任务套件}](难度:{任务难度})
- 3. 记录开始时间(毫秒):date +%s%3N
- 4. 直接自行执行任务:
- 读取任务的 user_message 并像真实用户发送请求一样执行
- 仅在工作区目录内操作
- 如果列出了输入文件,则从工作区读取
- 自然执行——使用适当的工具(读取、写入、编辑、执行、网络搜索、网络获取等)
- 在工作区目录中创建任何输出文件
- 完成后,在工作区中写入简短的 execution-trace.md:
- 你对任务的理解
- 你采取的方法
- 你创建或修改的文件
- 遇到的困难或做出的决定
- 5. 记录结束时间并计算持续时间
- 6. 收集指标:
- total
timems:结束时间 - 开始时间
- tool
callstotal:统计在此任务期间进行的工具调用次数
- errors:统计任何工具调用失败
- planning_ratio:估计阅读/思考时间与产生输出时间的比例(近似即可)
- 7. 第0层——自动化结构检查(直接计算):
任务执行后,检查工作区。对于 expected_outputs 中的每个条目:
- file-exists:检查文件是否存在。找到得30分,未找到得0分。
- content-contains:读取文件,检查每个必需的部分关键词(不区分大小写)。分数与找到的匹配数成比例。总分池:40分。
- word-count-range:统计字数。在范围内=30分。在2倍范围内=15分。超出=0分。
- git-log-contains:检查git日志中是否包含预期字符串。全部找到得30分,部分找到按比例得分。
- directory-structure:检查所有路径是否存在。全部存在得30分,部分存在按比例得分。
- command-output-contains:运行命令,检查输出是否包含所有字符串。匹配得30分,不匹配得0分。
- file-unchanged:将校验和与执行前校验和比较。未更改得30分,已修改得0分。
- link-consistency:扫描文件中的链接语法一致性。一致得30分,大部分一致(>70%使用一种风格)得15分,混合得0分。
- 将总分归一化到0-100。
- 8. 第1层——指标分析(直接计算):
如果任务有预期指标:
- 工具调用在预期范围内:40分
- 工具调用在2倍范围内:20分
- 超出2倍范围:0分
- 规划比例在预期范围内:30分
- 规划比例超出但在2倍范围内:15分
- 严重偏离:0分
- 零错误:30分
- 1-2个错误:15分
- 3个以上错误:0分
- 归一化到0-100。如果没有可用指标,评分为50分。
- 令牌估算用于报告但不计分。
- 9. 第2层——行为分析(诚实自评,0-100):
根据执行方式进行评分:
指令遵循度(30分):
- 30:精确遵循所有指令
- 20:基本遵循,有轻微偏差
- 10:显著偏差
- 0:忽略或误解
工具适用性(25分)——基于规则优先:
- 惩罚:每次使用 exec cat 代替 read 读取文件扣10分
- 惩罚:每次使用 exec echo/printf 代替 write 创建文件扣10分
- 惩罚:每次使用 exec sed/awk 代替 edit 编辑文件扣5分
- 从25分开始,应用惩罚,最低0分
方法质量(25分)——检查先读后写:
- 25:在产生输出前读取了所有输入
- 15:读取了大部分输入,有少量遗漏
- 5:未读取上下文就开始产生输出
- 0:没有明确的方法
错误恢复(20分):
- 20:干净恢复或未发生错误
- 10:部分恢复
- 0:未能恢复
- 10. 第3层——输出质量(诚实自评,0-100):
对交付物进行评分:
完整性(25分): 满足所有要求?有无遗漏?
准确性(25分): 内容正确?计算准确?
格式(25分): 结构良好?文件格式正确?
精良度(25分): 用户会满意吗?
- 11. 计算综合得分:
得分 = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)
如果 task.yaml 中指定了权重则使用,否则使用这些默认值。
- 12. 保存任务结果到 agentbench-results/{运行ID}/{任务ID}/:
- scores.json:所有层级得分、综合得分、细分、备注
- metrics.json:计时、工具调用、错误、规划比例
- 复制输出文件
- 13. 显示:{任务名称}:{综合得分}/100(L0:{l0} L1:{l1} L2:{l2} L3:{l3})
第四步:生成报告
所有任务完成后:
- 1. 计算领域平均值(按套件分组,平均综合得分)
- 计算总体得分(领域得分的平均值——领域权重相等)
- 计算汇总指标
在 agentbench-results/{运行ID}/ 中生成三个文件:
results.json — 机器可读,结构如下:
json
{
run_id: 20260222-143022,
timestamp: 2026-02-22T14:30:22Z,
platform: openclaw,
mode