SkillProbe

A/B evaluate whether a skill actually helps, or just adds complexity.

Runs inside the current agent runtime (Cursor, OpenClaw, ClaudeCode). No extra API key required.

7-Step Workflow

Copy this checklist and track progress:

CODEBLOCK0

Steps 1-3 and 6-7: You (orchestrator) do these.
Steps 4-5: Dispatch to isolated sub-agents. NEVER execute tasks yourself.

Steps 1-3: Prepare (Orchestrator)

1. Profile: Read the target skill's SKILL.md. Extract problem domain, trigger conditions, capabilities, boundaries.
Design plan: Choose task categories (QA, retrieval, coding, analysis, etc.), count, difficulty distribution (easy 30% / medium 40% / hard 20% / edge 10%).
Generate tasks: Create diverse, self-contained test prompts. Do NOT mention the skill name or A/B experiment in task prompts.

Steps 4-5: Dispatch (Three-Role Isolation)

Create two separate sub-agent sessions. See DISPATCH_PROTOCOL.md for exact prompt templates and constraints.

Key rules:

- Sub-Agent A (baseline): receives ONLY task prompts, zero skill content
Sub-Agent B (with-skill): receives task prompts + full skill content
Different session_id for each sub-agent
Orchestrator never answers any test task

Steps 6-7: Score and Report (Orchestrator)

Collect outputs from both sub-agents. Score across 6 dimensions (100-point scale). See SCORING_REFERENCE.md for scoring layers, dimension weights, thresholds, and output format.

Principles

1. Three-role isolation: Orchestrator designs and scores. Sub-agents execute. Never mix.
Real execution only: No hypothetical or simulated outputs.
Evidence-backed scoring: Rules and results first; LLM judge optional.
Attribution over numbers: Explain WHY, not just how much.
Finish before claiming uncertainty: Inconclusive only after real attempted execution.

Standalone CLI (Optional)

For local runs outside an agent:

CODEBLOCK1

Add --llm-judge [--judge-model <model>] for pairwise judge scoring. The CLI uses whatever LLM provider the local runtime is configured with.

Reference Files

- DISPATCHPROTOCOL.md: Three-role architecture, sub-agent prompt templates, dispatch constraints, evidence requirements
SCORINGREFERENCE.md: Scoring layers, 6-dimension weights, derived metrics, recommendation thresholds, report format

Security & Privacy

Skill content and task prompts are sent to the configured LLM provider only. All evaluation data stored locally. No telemetry.

SkillProbe

A/B评估某项技能是否真正有用，还是仅仅增加了复杂性。

在当前智能体运行时（Cursor、OpenClaw、ClaudeCode）内运行。无需额外API密钥。

7步工作流程

复制此检查清单并跟踪进度：

评估进度：

- [ ] 第1步：分析技能（读取SKILL.md，提取领域/触发条件/边界）
[ ] 第2步：设计评估方案（任务类别、数量、难度组合）
[ ] 第3步：生成测试任务（常规 + 边界 + 对抗性）
[ ] 第4步：将基线任务分派给子智能体A（不含技能内容！）
[ ] 第5步：将带技能任务分派给子智能体B（包含完整技能）
[ ] 第6步：对两次运行进行评分（规则 + 结果 + 可选LLM评判）
[ ] 第7步：归因差异并生成报告

第1-3步和第6-7步：由您（编排器）执行。
第4-5步：分派给隔离的子智能体。切勿自行执行任务。

第1-3步：准备（编排器）

1. 分析：读取目标技能的SKILL.md。提取问题领域、触发条件、能力、边界。
设计方案：选择任务类别（问答、检索、编码、分析等）、数量、难度分布（简单30% / 中等40% / 困难20% / 边缘10%）。
生成任务：创建多样化、自包含的测试提示。不要在任务提示中提及技能名称或A/B实验。

第4-5步：分派（三角色隔离）

创建两个独立的子智能体会话。具体提示模板和约束请参见DISPATCH_PROTOCOL.md。

关键规则：

- 子智能体A（基线）：仅接收任务提示，零技能内容
子智能体B（带技能）：接收任务提示 + 完整技能内容
每个子智能体使用不同的session_id
编排器绝不回答任何测试任务

第6-7步：评分与报告（编排器）

收集两个子智能体的输出。在6个维度上进行评分（百分制）。评分层级、维度权重、阈值和输出格式请参见SCORING_REFERENCE.md。

原则

1. 三角色隔离：编排器设计和评分。子智能体执行。绝不混合。
仅真实执行：无假设或模拟输出。
基于证据的评分：规则和结果优先；LLM评判可选。
归因重于数字：解释WHY，而不仅仅是程度。
先完成再声明不确定性：仅在真正尝试执行后才可判定不确定。

独立CLI（可选）

用于智能体外部的本地运行：

bash
skillprobe evaluate <技能路径> --tasks 30 --repeats 2 --db outputs/evaluations.db

添加--llm-judge [--judge-model <模型>]进行成对评判评分。CLI使用本地运行时配置的任何LLM提供商。

参考文件

- DISPATCHPROTOCOL.md：三角色架构、子智能体提示模板、分派约束、证据要求
SCORINGREFERENCE.md：评分层级、6维度权重、衍生指标、推荐阈值、报告格式

安全与隐私

技能内容和任务提示仅发送给已配置的LLM提供商。所有评估数据本地存储。无遥测。

skillprobe技能探测

skillprobe

SkillProbe