Test Skills Safely
Two use cases:
- 1. Try before commit — Test drive skills before installing
- Evaluate before publish — Verify quality before publishing
Key principle: Test in isolation. Never affect user's environment.
References:
- - Read
sandbox.md — Isolated testing environment - Read
compare.md — A/B comparison between skills - Read
evaluate.md — Multi-agent quality evaluation
Quick Start
Trial a skill:
CODEBLOCK0
Compare two skills:
- 1. Run same task through each (separate sub-agents)
- Present outputs side-by-side
- Ask: "Which works better? Why?"
Test Modes
Trial Mode — Before installing
- - Spawn sub-agent with ONLY the test skill
- Run 2-3 representative tasks
- Evaluate: Does it help? Clear instructions?
- Decision: keep, pass, or try another
Evaluation Mode — Before publishing
- - Spawn specialized reviewers (see
evaluate.md) - Check structure, safety, usefulness
- Synthesize findings
- Recommend improvements
Sandbox Isolation
⚠️ Never load test skill into your main context.
Sub-agent approach (recommended):
sessions_spawn(
task="You have ONE skill loaded: [skill content]. Test by doing: [task]",
model="anthropic/claude-haiku"
)
- - Complete isolation — main session unaffected
- Natural cleanup — sub-agent terminates, done
- Cheap testing — use Haiku
What to check:
- - Does it activate correctly?
- Are instructions clear?
- Token cost reasonable?
- Output quality acceptable?
Edge Cases
Skill requires credentials: Ask user for test credentials or skip auth-dependent features.
Skill not found: Verify slug with npx clawhub info <slug> before testing.
Test fails mid-way: Sub-agent terminates cleanly. Review logs, adjust test task, retry.
Skill has many auxiliary files: Load SKILL.md first, reference others only if needed during test.
Test thoroughly. Install only after explicit user approval.
安全测试技能
两种使用场景:
- 1. 先试后装 — 安装前试用技能
- 先评后发 — 发布前验证质量
核心原则: 隔离测试。绝不影响用户环境。
参考资料:
- - 阅读 sandbox.md — 隔离测试环境
- 阅读 compare.md — 技能间A/B对比
- 阅读 evaluate.md — 多智能体质量评估
快速上手
试用技能:
sessions_spawn(
task=测试技能X:仅加载其SKILL.md,运行[示例任务],报告质量,
model=anthropic/claude-haiku
)
对比两个技能:
- 1. 分别用每个技能执行相同任务(独立子智能体)
- 并排呈现输出结果
- 提问:哪个效果更好?原因是什么?
测试模式
试用模式 — 安装前
- - 仅加载待测技能生成子智能体
- 执行2-3个代表性任务
- 评估:是否有帮助?指令是否清晰?
- 决策:保留、通过或尝试其他
评估模式 — 发布前
- - 生成专业评审员(参见 evaluate.md)
- 检查结构、安全性、实用性
- 综合评估结果
- 提出改进建议
沙箱隔离
⚠️ 切勿将待测技能加载到主上下文中。
子智能体方法(推荐):
sessions_spawn(
task=你仅加载了一个技能:[技能内容]。通过执行以下任务进行测试:[任务],
model=anthropic/claude-haiku
)
- - 完全隔离 — 主会话不受影响
- 自然清理 — 子智能体终止即完成
- 低成本测试 — 使用Haiku模型
检查要点:
- - 能否正确激活?
- 指令是否清晰?
- Token消耗是否合理?
- 输出质量是否可接受?
边界情况
技能需要凭证: 向用户索要测试凭证,或跳过依赖认证的功能。
技能未找到: 测试前使用 npx clawhub info 验证标识符。
测试中途失败: 子智能体干净终止。审查日志,调整测试任务,重试。
技能包含大量辅助文件: 先加载SKILL.md,测试中仅在需要时引用其他文件。
充分测试。仅在获得用户明确批准后安装。