Skill Optimizer (Autoresearch Loop + Anthropic Structure Audit)
Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.
Phase 1: Structure Audit (run first, always)
Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:
Structural Checklist:
- 1. Gotchas section — Does SKILL.md have a
## Gotchas section with at least one real failure case? (Highest-signal content per Anthropic) - Trigger-phrase description — Does the YAML
description field say when to use the skill, not just what it does? Must include "Use when..." or equivalent trigger condition. - Progressive disclosure — Does the skill use the file system (references/, scripts/, assets/, config.json) instead of inline-dumping everything into SKILL.md?
- Single focus — Does the skill fit cleanly into one type (Library Reference, Verification, Automation, Scaffolding, Runbook, etc.) without straddling multiple?
- No railroading — Does the skill give Claude information + flexibility, rather than over-specifying how it must execute?
Score each: ✅ pass | ❌ fail | ⚠️ partial
For each failure: propose a concrete fix and apply if approved.
Quick wins to apply immediately:
- - If no Gotchas section → add INLINECODE2
- If description is a summary → rewrite as trigger condition
- If all content is inline → propose a
references/ folder structure
Phase 2: Output Quality Loop (autoresearch)
After structure audit, run the iterative improvement loop on the skill's actual outputs.
Setup
- 1. Which skill? — User specifies, or infer from context.
- Test inputs — Get 2-3 representative inputs. If the user doesn't provide them:
- Check the skill's own docs for example usage
- Use recent real invocations from memory/session history
- For extraction skills: use known-good URLs/files. For generation skills: use the skill's own example prompts.
- 3. Scoring checklist — Build 3-6 scoring items. Start from the examples below, then customize:
- What's the #1 thing that makes this skill's output
bad? (That's checklist item 1)
- What would make a user say "that's exactly what I wanted"? (That's the positive framing)
- Add 1-2 items from the "Universal structural quality" list below
Scoring Checklist Examples
See
references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).
Scoring Modes
Binary mode (default for simple skills): Yes/no per checklist item. Pass rate = total yes / (items × runs).
Dimensional mode (use for complex skills or when binary plateaus): Score each dimension 0-10. Identify the weakest dimension (lowest average across runs). Target that dimension for revision — do NOT rewrite everything.
Use dimensional mode when:
- - Binary scoring hits 100% but output still feels mediocre
- The skill has qualitative dimensions (tone, depth, relevance) that binary can't capture
- You want to improve from "good" to "excellent" rather than from "broken" to "working"
The Loop
CODEBLOCK0
Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.
Output Files
- -
skills/{skill-name}/SKILL-optimized.md — improved version (original untouched) - INLINECODE6 — full round log
Changelog Format
## Structural Audit
- Gotchas section: ❌ → Added placeholder
- Description: ❌ → Rewritten as trigger condition
- Progressive disclosure: ⚠️ → Noted, deferred
## Round 1 (binary mode)
- Score: 4/10 (40%)
- Weakest item: "Does it mention business name?"
- Change: Added rule "Always open with [Business Name],"
- New score: 7/10 (70%)
- Decision: KEPT
## Round 2 (dimensional mode)
- Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
- Weakest dimension: Tone (5/10)
- Change: Added "Match prospect's industry language, not generic sales speak"
- New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
- Decision: KEPT (Tone +2)
Optimizing Meta-Skills (Process Skills)
Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:
What to score: Score the experience of following the process, not a text artifact.
- - Did the process produce a clear result?
- Were there moments of confusion where the instructions were ambiguous?
- Did any step feel unnecessary or redundant?
- Could someone follow this without prior context?
How to test: Run the skill on 2-3 real tasks (not hypothetical). Score after each real use. The test inputs ARE the tasks you're applying the skill to.
Dimensional scoring for process skills:
- - Clarity — Can I follow each step without re-reading?
- Completeness — Does the process cover the full workflow?
- Actionability — Do I know exactly what to do at each step, or do I have to infer?
- Efficiency — Are there wasted/redundant steps?
- Self-applicability — Can this process improve itself? (Meta-test)
Checklist Sweet Spot
- - 3-6 questions = optimal
- Too few: not granular enough to guide changes
- Too many: skill starts gaming the checklist (like a student memorizing answers without understanding)
When to Use
- - Before running any skill at scale (cold outreach, content generation, scraping)
- After a new model upgrade — re-validate existing skills
- When a skill has inconsistent output quality
- Monthly maintenance pass on high-use skills
- Immediately after creating a new skill (structural audit only takes 5 min)
When to Run Which Phase
- - Any new skill → Structure audit (5 min, catches issues early)
- Before scale use → Output loop (validate quality before mass runs)
- After model upgrade → Output loop (re-validate existing skills)
- Inconsistent output → Output loop (find the failing item/dimension)
- High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)
Gotchas
- - Output loop requires skills that produce scoreable text outputs — scripts/tools that produce side effects need a different verification approach (use a Product Verification skill type instead)
- Don't run output loop on skills that call expensive APIs without rate limit awareness — each round runs the skill multiple times
- Phase 1 (structure audit) should always run before Phase 2 — fixing structure first makes the output loop more effective
- 3-6 checklist questions is the sweet spot — more than 6 and the skill starts gaming individual checks rather than improving overall quality
技能优化器(自动研究循环 + Anthropic 结构审计)
两阶段改进系统:(1) 对照 Anthropic 最佳实践进行结构审计,(2) 迭代输出质量循环。
阶段 1:结构审计(始终优先运行)
在优化输出质量之前,审计技能架构。对照以下 5 项结构检查进行评分:
结构检查清单:
- 1. 陷阱部分 — SKILL.md 是否有 ## Gotchas 部分,且包含至少一个真实失败案例?(根据 Anthropic,这是最高信号内容)
- 触发短语描述 — YAML description 字段是否说明了何时使用该技能,而不仅仅是做什么?必须包含当...时使用或等效触发条件。
- 渐进式披露 — 技能是否使用文件系统(references/、scripts/、assets/、config.json)而不是将所有内容内联倾倒在 SKILL.md 中?
- 单一焦点 — 技能是否清晰属于一种类型(库参考、验证、自动化、脚手架、运行手册等),而不跨越多种类型?
- 无过度约束 — 技能是否给予 Claude 信息+灵活性,而不是过度指定如何执行?
每项评分:✅ 通过 | ❌ 失败 | ⚠️ 部分通过
对于每项失败:提出具体修复方案,获批后应用。
可立即应用的快速修复:
- - 如果没有陷阱部分 → 添加 ## Gotchas\n- [占位符:发现真实失败案例后在此添加]
- 如果描述是摘要 → 重写为触发条件
- 如果所有内容都是内联的 → 提出 references/ 文件夹结构
阶段 2:输出质量循环(自动研究)
结构审计后,对技能的实际输出运行迭代改进循环。
设置
- 1. 哪个技能? — 用户指定,或从上下文推断。
- 测试输入 — 获取 2-3 个代表性输入。如果用户未提供:
- 检查技能自身文档中的示例用法
- 使用记忆/会话历史中的近期真实调用
- 对于提取技能:使用已知良好的 URL/文件。对于生成技能:使用技能自身的示例提示。
- 3. 评分检查清单 — 构建 3-6 个评分项目。从以下示例开始,然后自定义:
- 使该技能输出
糟糕的首要因素是什么?(这是检查清单第 1 项)
- 什么会让用户说这正是我想要的?(这是正面表述)
- 从下面的通用结构质量清单中添加 1-2 项
评分检查清单示例
参见 references/checklist-examples.md,按技能类型(冷外联、内容、研究、提取、流程/元技能)获取起始检查清单。
评分模式
二元模式(简单技能默认): 每项检查清单项目是/否。通过率 = 总是 /(项目数 × 运行次数)。
维度模式(复杂技能或二元模式停滞时使用): 每个维度评分 0-10。识别最弱维度(各运行中平均值最低)。针对该维度进行修订 — 不要重写所有内容。
在以下情况下使用维度模式:
- - 二元评分达到 100% 但输出仍感觉平庸
- 技能具有二元模式无法捕捉的定性维度(语气、深度、相关性)
- 你想从良好改进到优秀,而不是从糟糕改进到可用
循环
第 N 轮:
- 1. 对每个测试输入运行技能
- 对每个输出评分(二元:每个是得 1 分 | 维度:每个维度 0-10 分)
- 计算分数:
- 二元:通过率 = (总是) / (项目数 × 运行次数)
- 维度:各运行中每个维度的平均分
- 4. 识别最弱项目/维度(失败最多或平均分最低)
- 对 SKILL.md 进行一项针对性更改,仅解决该弱点
- 重新运行并重新评分
- 如果新分数 > 旧分数:保留。否则:还原。
- 记录:分数前后对比、所做更改、针对的维度、保留/还原
停止条件:二元 ≥ 95%(连续 3 轮)或 维度最弱项 ≥ 8/10(连续 3 轮)或达到 20 轮。
输出文件
- - skills/{skill-name}/SKILL-optimized.md — 改进版本(原始文件不变)
- skills/{skill-name}/optimization-changelog.md — 完整轮次日志
变更日志格式
markdown
结构审计
- - 陷阱部分:❌ → 添加了占位符
- 描述:❌ → 重写为触发条件
- 渐进式披露:⚠️ → 已记录,推迟处理
第 1 轮(二元模式)
- - 分数:4/10(40%)
- 最弱项目:是否提及企业名称?
- 更改:添加规则始终以[企业名称]开头
- 新分数:7/10(70%)
- 决定:保留
第 2 轮(维度模式)
- - 分数:准确性 8/10 | 语气 5/10 | 简洁性 9/10 | 相关性 7/10
- 最弱维度:语气(5/10)
- 更改:添加匹配潜在客户的行业语言,而非通用销售话术
- 新分数:准确性 8/10 | 语气 7/10 | 简洁性 9/10 | 相关性 7/10
- 决定:保留(语气 +2)
优化元技能(流程技能)
有些技能不产生文本 — 它们驱动一个流程(例如,本技能本身、规划工作流、研究管线)。对于这些技能:
评分内容: 评分遵循流程的体验,而非文本产物。
- - 流程是否产生了清晰的结果?
- 是否有指令模糊导致困惑的时刻?
- 是否有任何步骤感觉不必要或冗余?
- 没有先前上下文的人能否遵循此流程?
测试方法: 对 2-3 个真实任务(非假设性)运行技能。每次实际使用后评分。测试输入就是你要应用该技能的任务。
流程技能的维度评分:
- - 清晰度 — 我能否无需重读就能遵循每个步骤?
- 完整性 — 流程是否覆盖完整工作流?
- 可操作性 — 我是否确切知道每个步骤要做什么,还是需要推断?
- 效率 — 是否有浪费/冗余的步骤?
- 自适用性 — 此流程能否自我改进?(元测试)
检查清单最佳点
- - 3-6 个问题 = 最优
- 太少:不够细化,无法指导更改
- 太多:技能开始玩弄检查清单(就像学生死记硬背答案而不理解)
何时使用
- - 大规模运行任何技能之前(冷外联、内容生成、爬取)
- 新模型升级后 — 重新验证现有技能
- 当技能输出质量不一致时
- 对高使用率技能进行月度维护
- 创建新技能后立即使用(结构审计仅需 5 分钟)
何时运行哪个阶段
- - 任何新技能 → 结构审计(5 分钟,及早发现问题)
- 大规模使用前 → 输出循环(批量运行前验证质量)
- 模型升级后 → 输出循环(重新验证现有技能)
- 输出不一致 → 输出循环(找出失败项目/维度)
- 高收入技能 → 两个阶段(冷外联、内容生成 — 质量差异 = 收入影响)
陷阱
- - 输出循环要求技能产生可评分的文本输出 — 产生副作用的脚本/工具需要不同的验证方法(改用产品验证技能类型)
- 不要对调用昂贵 API 且无速率限制意识的技能运行输出循环 — 每轮循环会多次运行技能
- 阶段 1(结构审计)应始终在阶段 2 之前运行 — 先修复结构使输出循环更有效
- 3-6 个检查清单问题是最佳点 — 超过 6 个,技能开始玩弄个别检查而非改进整体质量