Skill Optimizer (Autoresearch Loop + Anthropic Structure Audit)

Two-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.

Phase 1: Structure Audit (run first, always)

Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:

Structural Checklist:

1. Gotchas section — Does SKILL.md have a ## Gotchas section with at least one real failure case? (Highest-signal content per Anthropic)
Trigger-phrase description — Does the YAML description field say when to use the skill, not just what it does? Must include "Use when..." or equivalent trigger condition.
Progressive disclosure — Does the skill use the file system (references/, scripts/, assets/, config.json) instead of inline-dumping everything into SKILL.md?
Single focus — Does the skill fit cleanly into one type (Library Reference, Verification, Automation, Scaffolding, Runbook, etc.) without straddling multiple?
No railroading — Does the skill give Claude information + flexibility, rather than over-specifying how it must execute?

Score each: ✅ pass | ❌ fail | ⚠️ partial

For each failure: propose a concrete fix and apply if approved.

Quick wins to apply immediately:

- If no Gotchas section → add INLINECODE2
If description is a summary → rewrite as trigger condition
If all content is inline → propose a references/ folder structure

Phase 2: Output Quality Loop (autoresearch)

After structure audit, run the iterative improvement loop on the skill's actual outputs.

Setup

1. Which skill? — User specifies, or infer from context.
Test inputs — Get 2-3 representative inputs. If the user doesn't provide them:

- Check the skill's own docs for example usage - Use recent real invocations from memory/session history - For extraction skills: use known-good URLs/files. For generation skills: use the skill's own example prompts.

3. Scoring checklist — Build 3-6 scoring items. Start from the examples below, then customize:

- What's the #1 thing that makes this skill's output bad? (That's checklist item 1) - What would make a user say "that's exactly what I wanted"? (That's the positive framing) - Add 1-2 items from the "Universal structural quality" list below

Scoring Checklist Examples

See references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).

Scoring Modes

Binary mode (default for simple skills): Yes/no per checklist item. Pass rate = total yes / (items × runs).

Dimensional mode (use for complex skills or when binary plateaus): Score each dimension 0-10. Identify the weakest dimension (lowest average across runs). Target that dimension for revision — do NOT rewrite everything.

Use dimensional mode when:

- Binary scoring hits 100% but output still feels mediocre
The skill has qualitative dimensions (tone, depth, relevance) that binary can't capture
You want to improve from "good" to "excellent" rather than from "broken" to "working"

The Loop

CODEBLOCK0

Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.

Output Files

- skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)
INLINECODE6 — full round log

Changelog Format

## Structural Audit
- Gotchas section: ❌ → Added placeholder
- Description: ❌ → Rewritten as trigger condition
- Progressive disclosure: ⚠️ → Noted, deferred

## Round 1 (binary mode)
- Score: 4/10 (40%)
- Weakest item: "Does it mention business name?"
- Change: Added rule "Always open with [Business Name],"
- New score: 7/10 (70%)
- Decision: KEPT

## Round 2 (dimensional mode)
- Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
- Weakest dimension: Tone (5/10)
- Change: Added "Match prospect's industry language, not generic sales speak"
- New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
- Decision: KEPT (Tone +2)

Optimizing Meta-Skills (Process Skills)

Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:

What to score: Score the experience of following the process, not a text artifact.

- Did the process produce a clear result?
Were there moments of confusion where the instructions were ambiguous?
Did any step feel unnecessary or redundant?
Could someone follow this without prior context?

How to test: Run the skill on 2-3 real tasks (not hypothetical). Score after each real use. The test inputs ARE the tasks you're applying the skill to.

Dimensional scoring for process skills:

- Clarity — Can I follow each step without re-reading?
Completeness — Does the process cover the full workflow?
Actionability — Do I know exactly what to do at each step, or do I have to infer?
Efficiency — Are there wasted/redundant steps?
Self-applicability — Can this process improve itself? (Meta-test)

Checklist Sweet Spot

- 3-6 questions = optimal
Too few: not granular enough to guide changes
Too many: skill starts gaming the checklist (like a student memorizing answers without understanding)

When to Use

- Before running any skill at scale (cold outreach, content generation, scraping)
After a new model upgrade — re-validate existing skills
When a skill has inconsistent output quality
Monthly maintenance pass on high-use skills
Immediately after creating a new skill (structural audit only takes 5 min)

When to Run Which Phase

- Any new skill → Structure audit (5 min, catches issues early)
Before scale use → Output loop (validate quality before mass runs)
After model upgrade → Output loop (re-validate existing skills)
Inconsistent output → Output loop (find the failing item/dimension)
High-revenue skills → Both phases (cold outreach, content gen — quality variance = revenue impact)

Gotchas

- Output loop requires skills that produce scoreable text outputs — scripts/tools that produce side effects need a different verification approach (use a Product Verification skill type instead)
Don't run output loop on skills that call expensive APIs without rate limit awareness — each round runs the skill multiple times
Phase 1 (structure audit) should always run before Phase 2 — fixing structure first makes the output loop more effective
3-6 checklist questions is the sweet spot — more than 6 and the skill starts gaming individual checks rather than improving overall quality

技能优化器（自动研究循环 + Anthropic 结构审计）

两阶段改进系统：(1) 对照 Anthropic 最佳实践进行结构审计，(2) 迭代输出质量循环。

阶段 1：结构审计（始终优先运行）

在优化输出质量之前，审计技能架构。对照以下 5 项结构检查进行评分：

结构检查清单：

1. 陷阱部分 — SKILL.md 是否有 ## Gotchas 部分，且包含至少一个真实失败案例？（根据 Anthropic，这是最高信号内容）
触发短语描述 — YAML description 字段是否说明了何时使用该技能，而不仅仅是做什么？必须包含当...时使用或等效触发条件。
渐进式披露 — 技能是否使用文件系统（references/、scripts/、assets/、config.json）而不是将所有内容内联倾倒在 SKILL.md 中？
单一焦点 — 技能是否清晰属于一种类型（库参考、验证、自动化、脚手架、运行手册等），而不跨越多种类型？
无过度约束 — 技能是否给予 Claude 信息+灵活性，而不是过度指定如何执行？

每项评分：✅ 通过 | ❌ 失败 | ⚠️ 部分通过

对于每项失败：提出具体修复方案，获批后应用。

可立即应用的快速修复：

- 如果没有陷阱部分 → 添加 ## Gotchas\n- [占位符：发现真实失败案例后在此添加]
如果描述是摘要 → 重写为触发条件
如果所有内容都是内联的 → 提出 references/ 文件夹结构

阶段 2：输出质量循环（自动研究）

结构审计后，对技能的实际输出运行迭代改进循环。

设置

1. 哪个技能？ — 用户指定，或从上下文推断。
测试输入 — 获取 2-3 个代表性输入。如果用户未提供：

- 检查技能自身文档中的示例用法 - 使用记忆/会话历史中的近期真实调用 - 对于提取技能：使用已知良好的 URL/文件。对于生成技能：使用技能自身的示例提示。

3. 评分检查清单 — 构建 3-6 个评分项目。从以下示例开始，然后自定义：

- 使该技能输出糟糕的首要因素是什么？（这是检查清单第 1 项） - 什么会让用户说这正是我想要的？（这是正面表述） - 从下面的通用结构质量清单中添加 1-2 项

评分检查清单示例

参见 references/checklist-examples.md，按技能类型（冷外联、内容、研究、提取、流程/元技能）获取起始检查清单。

评分模式

二元模式（简单技能默认）： 每项检查清单项目是/否。通过率 = 总是 /（项目数 × 运行次数）。

维度模式（复杂技能或二元模式停滞时使用）： 每个维度评分 0-10。识别最弱维度（各运行中平均值最低）。针对该维度进行修订 — 不要重写所有内容。

在以下情况下使用维度模式：

- 二元评分达到 100% 但输出仍感觉平庸
技能具有二元模式无法捕捉的定性维度（语气、深度、相关性）
你想从良好改进到优秀，而不是从糟糕改进到可用

循环

第 N 轮：

1. 对每个测试输入运行技能
对每个输出评分（二元：每个是得 1 分 | 维度：每个维度 0-10 分）
计算分数：

- 二元：通过率 = (总是) / (项目数 × 运行次数)
- 维度：各运行中每个维度的平均分

4. 识别最弱项目/维度（失败最多或平均分最低）
对 SKILL.md 进行一项针对性更改，仅解决该弱点
重新运行并重新评分
如果新分数 > 旧分数：保留。否则：还原。
记录：分数前后对比、所做更改、针对的维度、保留/还原

停止条件：二元 ≥ 95%（连续 3 轮）或维度最弱项 ≥ 8/10（连续 3 轮）或达到 20 轮。

输出文件

- skills/{skill-name}/SKILL-optimized.md — 改进版本（原始文件不变）
skills/{skill-name}/optimization-changelog.md — 完整轮次日志

变更日志格式

markdown

结构审计

- 陷阱部分：❌ → 添加了占位符
描述：❌ → 重写为触发条件
渐进式披露：⚠️ → 已记录，推迟处理

第 1 轮（二元模式）

- 分数：4/10（40%）
最弱项目：是否提及企业名称？
更改：添加规则始终以[企业名称]开头
新分数：7/10（70%）
决定：保留

第 2 轮（维度模式）

- 分数：准确性 8/10 | 语气 5/10 | 简洁性 9/10 | 相关性 7/10
最弱维度：语气（5/10）
更改：添加匹配潜在客户的行业语言，而非通用销售话术
新分数：准确性 8/10 | 语气 7/10 | 简洁性 9/10 | 相关性 7/10
决定：保留（语气 +2）

优化元技能（流程技能）

有些技能不产生文本 — 它们驱动一个流程（例如，本技能本身、规划工作流、研究管线）。对于这些技能：

评分内容： 评分遵循流程的体验，而非文本产物。

- 流程是否产生了清晰的结果？
是否有指令模糊导致困惑的时刻？
是否有任何步骤感觉不必要或冗余？
没有先前上下文的人能否遵循此流程？

测试方法： 对 2-3 个真实任务（非假设性）运行技能。每次实际使用后评分。测试输入就是你要应用该技能的任务。

流程技能的维度评分：

- 清晰度 — 我能否无需重读就能遵循每个步骤？
完整性 — 流程是否覆盖完整工作流？
可操作性 — 我是否确切知道每个步骤要做什么，还是需要推断？
效率 — 是否有浪费/冗余的步骤？
自适用性 — 此流程能否自我改进？（元测试）

检查清单最佳点

- 3-6 个问题 = 最优
太少：不够细化，无法指导更改
太多：技能开始玩弄检查清单（就像学生死记硬背答案而不理解）

何时使用

- 大规模运行任何技能之前（冷外联、内容生成、爬取）
新模型升级后 — 重新验证现有技能
当技能输出质量不一致时
对高使用率技能进行月度维护
创建新技能后立即使用（结构审计仅需 5 分钟）

何时运行哪个阶段

- 任何新技能 → 结构审计（5 分钟，及早发现问题）
大规模使用前 → 输出循环（批量运行前验证质量）
模型升级后 → 输出循环（重新验证现有技能）
输出不一致 → 输出循环（找出失败项目/维度）
高收入技能 → 两个阶段（冷外联、内容生成 — 质量差异 = 收入影响）

陷阱

- 输出循环要求技能产生可评分的文本输出 — 产生副作用的脚本/工具需要不同的验证方法（改用产品验证技能类型）
不要对调用昂贵 API 且无速率限制意识的技能运行输出循环 — 每轮循环会多次运行技能
阶段 1（结构审计）应始终在阶段 2 之前运行 — 先修复结构使输出循环更有效
3-6 个检查清单问题是最佳点 — 超过 6 个，技能开始玩弄个别检查而非改进整体质量

skill-optimizer技能优化器

skill-optimizer

Skill Optimizer (Autoresearch Loop + Anthropic Structure Audit)

Phase 1: Structure Audit (run first, always)