AB Test Eval — Automated Component Benchmarking via Subagents

Evaluate any OpenClaw component (skill, script, hook, cron job) by spawning parallel subagents and comparing arms. Supports multiple eval modes, auto-grading, and regression tracking.

Step 1: Choose the Eval Mode

Pick the mode that matches the user's intent:

Mode	Question	Arms
baseline	Does the skill help at all?	with-skill vs without-skill
regression

Default to baseline if unclear.

Step 2: Prepare Directory Structure

Create the eval workspace as a sibling to the skill directory:

CODEBLOCK0

Create directories with mkdir -p. Use descriptive arm names (e.g. with_skill, without_skill, new_version, old_version).

Step 3: Define or Generate Evals

If evals already exist

Read <skill-dir>/evals/evals.json and present the cases to the user for confirmation before running. Do not auto-run without sign-off.

If evals are missing

Generate them by reading the skill's SKILL.md and creating 4-6 realistic eval cases:

1. Happy path — clear request the skill should nail
Ambiguous request — could go multiple ways
Edge case — unusual params or corner case
Negative case — similar but should NOT trigger this skill
Multi-step case — complex multi-tool request
Adversarial case (if mode=adversarial) — misleading / typo / injected junk

Write to <skill-dir>/evals/evals.json:

CODEBLOCK1

Then show them to the user: "Here are the test cases I plan to run. Do these look right, or do you want to add more?"

Wait for approval before spawning subagents.

Step 4: Efficiency Controls — Dry-Run Preview & Smoke Test

Before spawning expensive subagents, offer the user two efficiency controls (especially useful when eval count > 3 or arms > 2).

`--dry` Preview

Generate a preview report that lists exactly what will run, without spawning any subagents:

CODEBLOCK2

Present this to the user and ask: "This looks like X evals across Y arms. Should I proceed, or do you want to trim the list?"

`--smoke` Smoke Test

If the user wants a quick confidence check, run only the first eval end-to-end (all arms + grading). This verifies the pipeline works before committing to the full run.

After a successful smoke test, ask: "Smoke test passed. Should I run the remaining N evals now?"

Step 5: Write Assertions

While waiting for user approval (or while subagents run), draft assertions in eval_metadata.json for each eval.

Save to <workspace>/iteration-N/<eval-name>/eval_metadata.json:

CODEBLOCK3

Assertions use text and expected fields. These are the basis for grading.

Step 6: Spawn Subagents in Parallel

For each eval, spawn all arms in the same turn. Launch as many as the environment allows concurrently.

Baseline mode

- withskill: load SKILL.md, execute prompt, save outputs
withoutskill: same prompt, no skill, save outputs

Regression mode

- newskill: load updated INLINECODE15
oldskill: load a snapshot of the previous version (make a cp -r snapshot before editing)

Model-swap mode

- model-a: run with skill + model A override
model-b: run with skill + model B override

Prompt-variant mode

- variant-a: load skill variant A's INLINECODE17
variant-b: load skill variant B's INLINECODE18

Trigger-accuracy mode

Each prompt gets ONE subagent tasked as the dispatcher:

"You are the dispatcher. Given this user prompt, would you load <skill-path>/SKILL.md before responding? Answer yes/no and explain why."

Save yes/no explanations, then grade TP/FP/TN/FN.

Adversarial mode

- clean: normal prompt + skill
perturbed: prompt with typos / injected irrelevance / misleading framing + skill

Script-test mode

- Run the bundled script with controlled inputs and assert on stdout, exit code, and generated files.
Arms can be: current-script vs previous-script, or script-with-skill-guidance vs naive-approach.
Assertions focus on correctness, idempotency, and edge-case handling.

Hook-dryrun mode

- Simulate a hook event by spawning a subagent and telling it: "Pretend you are an OpenClaw agent receiving a <hook-type> event with this payload. Given this hook's SKILL.md or config, what would you do?"
Do NOT modify actual system hook registrations. This is a read-only simulation.

Cron-dryrun mode

- Extract the cron job's payload (task command or script path from jobs.json or cron config).
Run the payload in an isolated subagent or exec dry-run context.
Assert on expected side effects, file outputs, or command sequence.
Also verify the cron expression is valid and produces expected schedule times.

Integration mode

- Test the full stack: user prompt → skill dispatch → script execution → hook response.
Arms: full-stack vs missing-script vs missing-hook vs skill-only.

Task template for standard arms:

CODEBLOCK4

Step 7: Capture Timing from Notifications

When each subagent completes, its notification includes total_tokens and duration_ms. This is the only chance to capture it.

Save to <arm>/timing.json:

CODEBLOCK5

Process each notification as it arrives rather than batching.

Step 8: Auto-Grade with LLM-as-Judge

Spawn a grading subagent per eval to compare all arms against the assertions:

CODEBLOCK6

Each grading.json schema:

CODEBLOCK7

For trigger-accuracy runs, save a separate trigger_grading.json with tp, fp, tn, fn tallies at the eval level.

Step 9: Aggregate and Generate Report

Write benchmark.json:

CODEBLOCK8

Append a compact line to history.jsonl for regression tracking.

Then write benchmark.md with:

- Executive summary (delta, winner, biggest weaknesses)
Per-eval breakdown table
Notable failures with quotes
Recommendations for improving the skill

Present the summary to the user directly in chat.

Step 10: Iterate Based on Feedback

1. Discuss results with the user
Improve the skill based on failed assertions
Rerun into INLINECODE36
Compare history.jsonl entries for trend
Repeat until satisfied

Mode-Specific Notes

- Regression: Snapshot old skill before editing (cp -r). Use previous version as baseline.
Model-swap: Use sessions_spawn with model override.
Prompt-variant: Create two temp skill copies with different descriptions.
Trigger-accuracy: Generate 10 queries (5 should-trigger, 5 near-miss should-not). Grade precision/recall/F1.
Adversarial: Perturbations include typos, irrelevant context injection, misleading framing. Report degradation score = cleanavg - perturbedavg.
Script-test: Run via exec for deterministic results unless script invokes LLM. Check happy-path AND error handling.
Hook-dryrun: Simulate event via subagent with exact payload JSON. Do NOT modify actual hook registrations.
Cron-dryrun: Validate cron expression and list next N execution times. If payload sends messages, use dry-run constraint.
Integration: For missing-component arms, tell subagent: "You do NOT have access to ."

Hard Constraints

- Do not auto-run evals without user sign-off — present evals and wait for approval before spawning
Respect --dry and --smoke — offer preview / smoke-test paths to improve UX and reduce wasted tokens

AB Test Eval — 通过子代理实现自动化组件基准测试

通过生成并行子代理并比较不同分支，评估任何OpenClaw组件（技能、脚本、钩子、定时任务）。支持多种评估模式、自动评分和回归追踪。

第一步：选择评估模式

选择与用户意图匹配的模式：

模式	问题	比较分支
基线测试	技能是否有帮助？	有技能 vs 无技能
回归测试

如果不明确，默认使用基线测试。

第二步：准备目录结构

在与技能目录同级的位置创建评估工作空间：

/evals/evals.json
/-workspace/
iteration-1/
/
/
outputs/commands.md
timing.json
grading.json
/
outputs/commands.md
timing.json
grading.json
eval_metadata.json
benchmark.json
benchmark.md
iteration-2/
...
history.jsonl

使用 mkdir -p 创建目录。使用描述性的分支名称（例如 withskill、withoutskill、newversion、oldversion）。

第三步：定义或生成评估用例

如果评估用例已存在

读取 /evals/evals.json 并在运行前向用户展示用例以供确认。未经用户批准不得自动运行。

如果评估用例缺失

通过读取技能的 SKILL.md 并创建4-6个真实的评估用例来生成：

1. 快乐路径 — 技能应该能完美处理的明确请求
模糊请求 — 可能有多种处理方式
边界情况 — 不寻常的参数或极端情况
否定情况 — 相似但不应触发此技能的请求
多步骤情况 — 复杂的多工具请求
对抗情况（如果模式=对抗测试）— 误导性/拼写错误/注入垃圾信息

写入 /evals/evals.json：

json
{
skill_name: my-skill,
evals: [
{
id: 1,
prompt: 真实的用户请求,
expected_output: 正确行为应该是什么样的,
files: []
}
]
}

然后向用户展示：以下是我计划运行的测试用例。这些看起来正确吗，或者您想添加更多？

在生成子代理之前等待用户批准。

第四步：效率控制 — 模拟运行预览与冒烟测试

在生成昂贵的子代理之前，向用户提供两种效率控制（当评估用例数量 > 3 或分支数量 > 2 时尤其有用）。

--dry 预览

生成一个预览报告，列出将要运行的内容，而不生成任何子代理：

markdown

评估预览报告

- 模式：基线测试
评估用例数：4
每个评估的分支数：2（有技能，无技能）
模型：当前
预估子代理调用次数：8

评估用例：

1. happy-path-basic — 2个分支，3个断言
ambiguous-request — 2个分支，3个断言

向用户展示并询问：这看起来是X个评估用例，分布在Y个分支上。我应该继续，还是您想精简列表？

--smoke 冒烟测试

如果用户想要快速信心检查，只运行第一个评估用例的端到端流程（所有分支 + 评分）。这可以在投入完整运行之前验证管道是否正常工作。

冒烟测试成功后，询问：冒烟测试通过。我现在应该运行剩余的N个评估用例吗？

第五步：编写断言

在等待用户批准时（或子代理运行时），为每个评估用例在 eval_metadata.json 中起草断言。

保存到 /iteration-N//eval_metadata.json：

json
{
eval_id: 1,
eval_name: happy-path-basic,
prompt: 用户的任务提示,
assertions: [
{
text: 使用了 --force 标志,
expected: true
},
{
text: 警告了OAuth超时陷阱,
expected: true
}
]
}

断言使用 text 和 expected 字段。这些是评分的基础。

第六步：并行生成子代理

对于每个评估用例，在同一轮中生成所有分支。尽可能多地并发启动环境允许的子代理。

基线模式

- withskill：加载 SKILL.md，执行提示，保存输出
withoutskill：相同提示，无技能，保存输出

回归模式

- newskill：加载更新后的 SKILL.md
oldskill：加载先前版本的快照（在编辑前执行 cp -r 快照）

模型切换模式

- model-a：使用技能 + 模型A覆盖运行
model-b：使用技能 + 模型B覆盖运行

提示词变体模式

- variant-a：加载技能变体A的 SKILL.md
variant-b：加载技能变体B的 SKILL.md

触发准确性模式

每个提示分配一个子代理作为调度器：

你是调度器。给定这个用户提示，你会在响应前加载 /SKILL.md 吗？回答是/否并解释原因。

保存是/否的解释，然后评分TP/FP/TN/FN。

对抗模式

- clean：正常提示 + 技能
perturbed：带有拼写错误/注入无关内容/误导性框架的提示 + 技能

脚本测试模式

- 使用受控输入运行捆绑脚本，并在stdout、退出代码和生成的文件上进行断言。
分支可以是：current-script vs previous-script，或 script-with-skill-guidance vs naive-approach。
断言侧重于正确性、幂等性和边界情况处理。

钩子模拟运行模式

- 通过生成一个子代理并告诉它来模拟一个钩子事件：假设你是一个OpenClaw代理，正在接收一个事件，负载如下。给定这个钩子的 SKILL.md 或配置，你会怎么做？
不要修改实际的系统钩子注册。这是一个只读模拟。

定时任务模拟运行模式

- 提取定时任务负载（来自 jobs.json 或定时任务配置的任务命令或脚本路径）。
在隔离的子代理或 exec 模拟运行上下文中运行负载。
对预期的副作用、文件输出或命令序列进行断言。
同时验证cron表达式是否有效并产生预期的调度时间。

集成模式

- 测试全栈：用户提示 → 技能调度 → 脚本执行 → 钩子响应。
分支：full-stack vs missing-script vs missing-hook vs skill-only。

标准分支的任务模板：

执行此任务：

- 分支：
技能路径：或 none
模型覆盖：或 default
任务：
输入文件：或 none
保存输出到：/iteration-N///outputs/commands.md
使用可用工具执行任务 — 如果子代理有工具访问权限，则实际运行命令；如果没有，则记录将要执行的操作。

第七步：从通知中捕获计时信息

当每个子代理完成时，其通知包含 totaltokens 和 durationms。这是捕获它的唯一机会。

保存到 /timing.json：

json
{
total_tokens: 84852,
duration_ms: 23332,
totalduration

ab-test-evalA/B评估测试

ab-test-eval

AB Test Eval — Automated Component Benchmarking via Subagents

Step 1: Choose the Eval Mode

Step 2: Prepare Directory Structure

Step 3: Define or Generate Evals

If evals already exist

If evals are missing

Step 4: Efficiency Controls — Dry-Run Preview & Smoke Test

--dry Preview

--smoke Smoke Test

Step 5: Write Assertions

Step 6: Spawn Subagents in Parallel

Baseline mode

Regression mode

Model-swap mode

Prompt-variant mode

Trigger-accuracy mode

Adversarial mode

Script-test mode

Hook-dryrun mode

Cron-dryrun mode

Integration mode

Step 7: Capture Timing from Notifications

Step 8: Auto-Grade with LLM-as-Judge

Step 9: Aggregate and Generate Report

Step 10: Iterate Based on Feedback

Mode-Specific Notes

Hard Constraints

AB Test Eval — 通过子代理实现自动化组件基准测试

第一步：选择评估模式

第二步：准备目录结构

第三步：定义或生成评估用例

如果评估用例已存在

如果评估用例缺失

第四步：效率控制 — 模拟运行预览与冒烟测试

--dry 预览

评估预览报告

--smoke 冒烟测试

第五步：编写断言

第六步：并行生成子代理

基线模式

回归模式

模型切换模式

提示词变体模式

触发准确性模式

对抗模式

脚本测试模式

钩子模拟运行模式

定时任务模拟运行模式

集成模式

第七步：从通知中捕获计时信息

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

`--dry` Preview

`--smoke` Smoke Test