SkillProbe
A/B evaluate whether a skill actually helps, or just adds complexity.
Runs inside the current agent runtime (Cursor, OpenClaw, ClaudeCode). No extra API key required.
7-Step Workflow
Copy this checklist and track progress:
CODEBLOCK0
Steps 1-3 and 6-7: You (orchestrator) do these.
Steps 4-5: Dispatch to isolated sub-agents. NEVER execute tasks yourself.
Steps 1-3: Prepare (Orchestrator)
- 1. Profile: Read the target skill's SKILL.md. Extract problem domain, trigger conditions, capabilities, boundaries.
- Design plan: Choose task categories (QA, retrieval, coding, analysis, etc.), count, difficulty distribution (easy 30% / medium 40% / hard 20% / edge 10%).
- Generate tasks: Create diverse, self-contained test prompts. Do NOT mention the skill name or A/B experiment in task prompts.
Steps 4-5: Dispatch (Three-Role Isolation)
Create two separate sub-agent sessions. See DISPATCH_PROTOCOL.md for exact prompt templates and constraints.
Key rules:
- - Sub-Agent A (baseline): receives ONLY task prompts, zero skill content
- Sub-Agent B (with-skill): receives task prompts + full skill content
- Different
session_id for each sub-agent - Orchestrator never answers any test task
Steps 6-7: Score and Report (Orchestrator)
Collect outputs from both sub-agents. Score across 6 dimensions (100-point scale). See SCORING_REFERENCE.md for scoring layers, dimension weights, thresholds, and output format.
Principles
- 1. Three-role isolation: Orchestrator designs and scores. Sub-agents execute. Never mix.
- Real execution only: No hypothetical or simulated outputs.
- Evidence-backed scoring: Rules and results first; LLM judge optional.
- Attribution over numbers: Explain WHY, not just how much.
- Finish before claiming uncertainty:
Inconclusive only after real attempted execution.
Standalone CLI (Optional)
For local runs outside an agent:
CODEBLOCK1
Add --llm-judge [--judge-model <model>] for pairwise judge scoring. The CLI uses whatever LLM provider the local runtime is configured with.
Reference Files
- - DISPATCHPROTOCOL.md: Three-role architecture, sub-agent prompt templates, dispatch constraints, evidence requirements
- SCORINGREFERENCE.md: Scoring layers, 6-dimension weights, derived metrics, recommendation thresholds, report format
Security & Privacy
Skill content and task prompts are sent to the configured LLM provider only. All evaluation data stored locally. No telemetry.
SkillProbe
A/B评估某项技能是否真正有用,还是仅仅增加了复杂性。
在当前智能体运行时(Cursor、OpenClaw、ClaudeCode)内运行。无需额外API密钥。
7步工作流程
复制此检查清单并跟踪进度:
评估进度:
- - [ ] 第1步:分析技能(读取SKILL.md,提取领域/触发条件/边界)
- [ ] 第2步:设计评估方案(任务类别、数量、难度组合)
- [ ] 第3步:生成测试任务(常规 + 边界 + 对抗性)
- [ ] 第4步:将基线任务分派给子智能体A(不含技能内容!)
- [ ] 第5步:将带技能任务分派给子智能体B(包含完整技能)
- [ ] 第6步:对两次运行进行评分(规则 + 结果 + 可选LLM评判)
- [ ] 第7步:归因差异并生成报告
第1-3步和第6-7步:由您(编排器)执行。
第4-5步:分派给隔离的子智能体。切勿自行执行任务。
第1-3步:准备(编排器)
- 1. 分析:读取目标技能的SKILL.md。提取问题领域、触发条件、能力、边界。
- 设计方案:选择任务类别(问答、检索、编码、分析等)、数量、难度分布(简单30% / 中等40% / 困难20% / 边缘10%)。
- 生成任务:创建多样化、自包含的测试提示。不要在任务提示中提及技能名称或A/B实验。
第4-5步:分派(三角色隔离)
创建两个独立的子智能体会话。具体提示模板和约束请参见DISPATCH_PROTOCOL.md。
关键规则:
- - 子智能体A(基线):仅接收任务提示,零技能内容
- 子智能体B(带技能):接收任务提示 + 完整技能内容
- 每个子智能体使用不同的session_id
- 编排器绝不回答任何测试任务
第6-7步:评分与报告(编排器)
收集两个子智能体的输出。在6个维度上进行评分(百分制)。评分层级、维度权重、阈值和输出格式请参见SCORING_REFERENCE.md。
原则
- 1. 三角色隔离:编排器设计和评分。子智能体执行。绝不混合。
- 仅真实执行:无假设或模拟输出。
- 基于证据的评分:规则和结果优先;LLM评判可选。
- 归因重于数字:解释WHY,而不仅仅是程度。
- 先完成再声明不确定性:仅在真正尝试执行后才可判定不确定。
独立CLI(可选)
用于智能体外部的本地运行:
bash
skillprobe evaluate <技能路径> --tasks 30 --repeats 2 --db outputs/evaluations.db
添加--llm-judge [--judge-model <模型>]进行成对评判评分。CLI使用本地运行时配置的任何LLM提供商。
参考文件
安全与隐私
技能内容和任务提示仅发送给已配置的LLM提供商。所有评估数据本地存储。无遥测。