AB Test Eval — Automated Component Benchmarking via Subagents
Evaluate any OpenClaw component (skill, script, hook, cron job) by spawning parallel subagents and comparing arms. Supports multiple eval modes, auto-grading, and regression tracking.
Step 1: Choose the Eval Mode
Pick the mode that matches the user's intent:
| Mode | Question | Arms |
|---|
| baseline | Does the skill help at all? | with-skill vs without-skill |
| regression |
Did changes break anything? | skill-v2 vs skill-v1 |
|
model-swap | Works on another model? | model-A vs model-B |
|
prompt-variant | Which description works better? | variant-A vs variant-B |
|
trigger-accuracy | Dispatches correctly? | should-trigger vs should-not |
|
adversarial | Robust against bad inputs? | clean vs perturbed |
|
script-test | Script produces correct output? | script-A vs script-B |
|
hook-dryrun | Hook responds correctly? | with-hook vs without |
|
cron-dryrun | Cron payload does the right thing? | cron-run vs baseline |
|
integration | Full stack works together? | full vs missing-component |
Default to baseline if unclear.
Step 2: Prepare Directory Structure
Create the eval workspace as a sibling to the skill directory:
CODEBLOCK0
Create directories with mkdir -p. Use descriptive arm names (e.g. with_skill, without_skill, new_version, old_version).
Step 3: Define or Generate Evals
If evals already exist
Read
<skill-dir>/evals/evals.json and present the cases to the user for confirmation before running. Do not auto-run without sign-off.
If evals are missing
Generate them by reading the skill's
SKILL.md and creating 4-6 realistic eval cases:
- 1. Happy path — clear request the skill should nail
- Ambiguous request — could go multiple ways
- Edge case — unusual params or corner case
- Negative case — similar but should NOT trigger this skill
- Multi-step case — complex multi-tool request
- Adversarial case (if mode=adversarial) — misleading / typo / injected junk
Write to <skill-dir>/evals/evals.json:
CODEBLOCK1
Then show them to the user: "Here are the test cases I plan to run. Do these look right, or do you want to add more?"
Wait for approval before spawning subagents.
Step 4: Efficiency Controls — Dry-Run Preview & Smoke Test
Before spawning expensive subagents, offer the user two efficiency controls (especially useful when eval count > 3 or arms > 2).
--dry Preview
Generate a
preview report that lists exactly what will run, without spawning any subagents:
CODEBLOCK2
Present this to the user and ask: "This looks like X evals across Y arms. Should I proceed, or do you want to trim the list?"
--smoke Smoke Test
If the user wants a quick confidence check, run
only the first eval end-to-end (all arms + grading). This verifies the pipeline works before committing to the full run.
After a successful smoke test, ask: "Smoke test passed. Should I run the remaining N evals now?"
Step 5: Write Assertions
While waiting for user approval (or while subagents run), draft assertions in eval_metadata.json for each eval.
Save to <workspace>/iteration-N/<eval-name>/eval_metadata.json:
CODEBLOCK3
Assertions use text and expected fields. These are the basis for grading.
Step 6: Spawn Subagents in Parallel
For each eval, spawn all arms in the same turn. Launch as many as the environment allows concurrently.
Baseline mode
- - withskill: load
SKILL.md, execute prompt, save outputs - withoutskill: same prompt, no skill, save outputs
Regression mode
- - newskill: load updated INLINECODE15
- oldskill: load a snapshot of the previous version (make a
cp -r snapshot before editing)
Model-swap mode
- - model-a: run with skill + model A override
- model-b: run with skill + model B override
Prompt-variant mode
- - variant-a: load skill variant A's INLINECODE17
- variant-b: load skill variant B's INLINECODE18
Trigger-accuracy mode
Each prompt gets ONE subagent tasked as the dispatcher:
"You are the dispatcher. Given this user prompt, would you load <skill-path>/SKILL.md before responding? Answer yes/no and explain why."
Save yes/no explanations, then grade TP/FP/TN/FN.
Adversarial mode
- - clean: normal prompt + skill
- perturbed: prompt with typos / injected irrelevance / misleading framing + skill
Script-test mode
- - Run the bundled script with controlled inputs and assert on stdout, exit code, and generated files.
- Arms can be: current-script vs previous-script, or script-with-skill-guidance vs naive-approach.
- Assertions focus on correctness, idempotency, and edge-case handling.
Hook-dryrun mode
- - Simulate a hook event by spawning a subagent and telling it: "Pretend you are an OpenClaw agent receiving a
<hook-type> event with this payload. Given this hook's SKILL.md or config, what would you do?" - Do NOT modify actual system hook registrations. This is a read-only simulation.
Cron-dryrun mode
- - Extract the cron job's payload (task command or script path from
jobs.json or cron config). - Run the payload in an isolated subagent or
exec dry-run context. - Assert on expected side effects, file outputs, or command sequence.
- Also verify the cron expression is valid and produces expected schedule times.
Integration mode
- - Test the full stack: user prompt → skill dispatch → script execution → hook response.
- Arms: full-stack vs missing-script vs missing-hook vs skill-only.
Task template for standard arms:
CODEBLOCK4
Step 7: Capture Timing from Notifications
When each subagent completes, its notification includes total_tokens and duration_ms. This is the only chance to capture it.
Save to <arm>/timing.json:
CODEBLOCK5
Process each notification as it arrives rather than batching.
Step 8: Auto-Grade with LLM-as-Judge
Spawn a grading subagent per eval to compare all arms against the assertions:
CODEBLOCK6
Each grading.json schema:
CODEBLOCK7
For trigger-accuracy runs, save a separate trigger_grading.json with tp, fp, tn, fn tallies at the eval level.
Step 9: Aggregate and Generate Report
Write benchmark.json:
CODEBLOCK8
Append a compact line to history.jsonl for regression tracking.
Then write benchmark.md with:
- - Executive summary (delta, winner, biggest weaknesses)
- Per-eval breakdown table
- Notable failures with quotes
- Recommendations for improving the skill
Present the summary to the user directly in chat.
Step 10: Iterate Based on Feedback
- 1. Discuss results with the user
- Improve the skill based on failed assertions
- Rerun into INLINECODE36
- Compare
history.jsonl entries for trend - Repeat until satisfied
Mode-Specific Notes
- - Regression: Snapshot old skill before editing (
cp -r). Use previous version as baseline. - Model-swap: Use
sessions_spawn with model override. - Prompt-variant: Create two temp skill copies with different descriptions.
- Trigger-accuracy: Generate 10 queries (5 should-trigger, 5 near-miss should-not). Grade precision/recall/F1.
- Adversarial: Perturbations include typos, irrelevant context injection, misleading framing. Report degradation score = cleanavg - perturbedavg.
- Script-test: Run via
exec for deterministic results unless script invokes LLM. Check happy-path AND error handling. - Hook-dryrun: Simulate event via subagent with exact payload JSON. Do NOT modify actual hook registrations.
- Cron-dryrun: Validate cron expression and list next N execution times. If payload sends messages, use dry-run constraint.
- Integration: For missing-component arms, tell subagent: "You do NOT have access to ."
Hard Constraints
- - Do not auto-run evals without user sign-off — present evals and wait for approval before spawning
- Respect
--dry and --smoke — offer preview / smoke-test paths to improve UX and reduce wasted tokens
AB Test Eval — 通过子代理实现自动化组件基准测试
通过生成并行子代理并比较不同分支,评估任何OpenClaw组件(技能、脚本、钩子、定时任务)。支持多种评估模式、自动评分和回归追踪。
第一步:选择评估模式
选择与用户意图匹配的模式:
| 模式 | 问题 | 比较分支 |
|---|
| 基线测试 | 技能是否有帮助? | 有技能 vs 无技能 |
| 回归测试 |
改动是否破坏了什么? | 技能v2 vs 技能v1 |
|
模型切换 | 在另一个模型上能否工作? | 模型A vs 模型B |
|
提示词变体 | 哪个描述效果更好? | 变体A vs 变体B |
|
触发准确性 | 调度是否正确? | 应触发 vs 不应触发 |
|
对抗测试 | 对不良输入的鲁棒性? | 干净 vs 扰动 |
|
脚本测试 | 脚本输出是否正确? | 脚本A vs 脚本B |
|
钩子模拟运行 | 钩子响应是否正确? | 有钩子 vs 无钩子 |
|
定时任务模拟运行 | 定时任务负载是否正确执行? | 定时任务运行 vs 基线 |
|
集成测试 | 全栈能否协同工作? | 完整 vs 缺少组件 |
如果不明确,默认使用基线测试。
第二步:准备目录结构
在与技能目录同级的位置创建评估工作空间:
/evals/evals.json
/-workspace/
iteration-1/
/
/
outputs/commands.md
timing.json
grading.json
/
outputs/commands.md
timing.json
grading.json
eval_metadata.json
benchmark.json
benchmark.md
iteration-2/
...
history.jsonl
使用 mkdir -p 创建目录。使用描述性的分支名称(例如 withskill、withoutskill、newversion、oldversion)。
第三步:定义或生成评估用例
如果评估用例已存在
读取
/evals/evals.json 并在运行前向用户展示用例以供确认。未经用户批准不得自动运行。
如果评估用例缺失
通过读取技能的 SKILL.md 并创建4-6个真实的评估用例来生成:
- 1. 快乐路径 — 技能应该能完美处理的明确请求
- 模糊请求 — 可能有多种处理方式
- 边界情况 — 不寻常的参数或极端情况
- 否定情况 — 相似但不应触发此技能的请求
- 多步骤情况 — 复杂的多工具请求
- 对抗情况(如果模式=对抗测试)— 误导性/拼写错误/注入垃圾信息
写入 /evals/evals.json:
json
{
skill_name: my-skill,
evals: [
{
id: 1,
prompt: 真实的用户请求,
expected_output: 正确行为应该是什么样的,
files: []
}
]
}
然后向用户展示:以下是我计划运行的测试用例。这些看起来正确吗,或者您想添加更多?
在生成子代理之前等待用户批准。
第四步:效率控制 — 模拟运行预览与冒烟测试
在生成昂贵的子代理之前,向用户提供两种效率控制(当评估用例数量 > 3 或分支数量 > 2 时尤其有用)。
--dry 预览
生成一个预览报告,列出将要运行的内容,而不生成任何子代理:
markdown
评估预览报告
- - 模式:基线测试
- 评估用例数:4
- 每个评估的分支数:2(有技能,无技能)
- 模型:当前
- 预估子代理调用次数:8
评估用例:
- 1. happy-path-basic — 2个分支,3个断言
- ambiguous-request — 2个分支,3个断言
向用户展示并询问:这看起来是X个评估用例,分布在Y个分支上。我应该继续,还是您想精简列表?
--smoke 冒烟测试
如果用户想要快速信心检查,只运行第一个评估用例的端到端流程(所有分支 + 评分)。这可以在投入完整运行之前验证管道是否正常工作。
冒烟测试成功后,询问:冒烟测试通过。我现在应该运行剩余的N个评估用例吗?
第五步:编写断言
在等待用户批准时(或子代理运行时),为每个评估用例在 eval_metadata.json 中起草断言。
保存到 /iteration-N//eval_metadata.json:
json
{
eval_id: 1,
eval_name: happy-path-basic,
prompt: 用户的任务提示,
assertions: [
{
text: 使用了 --force 标志,
expected: true
},
{
text: 警告了OAuth超时陷阱,
expected: true
}
]
}
断言使用 text 和 expected 字段。这些是评分的基础。
第六步:并行生成子代理
对于每个评估用例,在同一轮中生成所有分支。尽可能多地并发启动环境允许的子代理。
基线模式
- - withskill:加载 SKILL.md,执行提示,保存输出
- withoutskill:相同提示,无技能,保存输出
回归模式
- - newskill:加载更新后的 SKILL.md
- oldskill:加载先前版本的快照(在编辑前执行 cp -r 快照)
模型切换模式
- - model-a:使用技能 + 模型A覆盖运行
- model-b:使用技能 + 模型B覆盖运行
提示词变体模式
- - variant-a:加载技能变体A的 SKILL.md
- variant-b:加载技能变体B的 SKILL.md
触发准确性模式
每个提示分配一个子代理作为调度器:
你是调度器。给定这个用户提示,你会在响应前加载 /SKILL.md 吗?回答是/否并解释原因。
保存是/否的解释,然后评分TP/FP/TN/FN。
对抗模式
- - clean:正常提示 + 技能
- perturbed:带有拼写错误/注入无关内容/误导性框架的提示 + 技能
脚本测试模式
- - 使用受控输入运行捆绑脚本,并在stdout、退出代码和生成的文件上进行断言。
- 分支可以是:current-script vs previous-script,或 script-with-skill-guidance vs naive-approach。
- 断言侧重于正确性、幂等性和边界情况处理。
钩子模拟运行模式
- - 通过生成一个子代理并告诉它来模拟一个钩子事件:假设你是一个OpenClaw代理,正在接收一个 事件,负载如下。给定这个钩子的 SKILL.md 或配置,你会怎么做?
- 不要修改实际的系统钩子注册。这是一个只读模拟。
定时任务模拟运行模式
- - 提取定时任务负载(来自 jobs.json 或定时任务配置的任务命令或脚本路径)。
- 在隔离的子代理或 exec 模拟运行上下文中运行负载。
- 对预期的副作用、文件输出或命令序列进行断言。
- 同时验证cron表达式是否有效并产生预期的调度时间。
集成模式
- - 测试全栈:用户提示 → 技能调度 → 脚本执行 → 钩子响应。
- 分支:full-stack vs missing-script vs missing-hook vs skill-only。
标准分支的任务模板:
执行此任务:
- - 分支:
- 技能路径: 或 none
- 模型覆盖: 或 default
- 任务:
- 输入文件: 或 none
- 保存输出到:/iteration-N///outputs/commands.md
- 使用可用工具执行任务 — 如果子代理有工具访问权限,则实际运行命令;如果没有,则记录将要执行的操作。
第七步:从通知中捕获计时信息
当每个子代理完成时,其通知包含 totaltokens 和 durationms。这是捕获它的唯一机会。
保存到 /timing.json:
json
{
total_tokens: 84852,
duration_ms: 23332,
totalduration