Skill Test
A skill for auditing any Agent Skill against the official Agent Skills specification and best practices from the anthropics/skills repository.
Language Detection
Detect the language of the user's request and generate the entire report in that language:
- - If the user writes in Chinese (e.g., "检查我的skill", "给这个skill打分", "这个skill符合规范吗"): generate the full report in Chinese, including all section headings, findings, improvement suggestions, and the quick fix checklist.
- If the user writes in English (e.g., "check my skill", "score this skill", "does this follow the spec"): generate the full report in English.
- If the language is ambiguous or mixed: default to English, but add a note at the top of the report: "Note: Report generated in English. Reply in Chinese if you'd prefer a Chinese report."
Apply this language choice consistently throughout the entire report — do not mix languages within a single report.
What This Skill Does
Given a path to a skill directory or SKILL.md file, you will:
- 1. Read every file in the skill directory (SKILL.md, scripts/, references/, assets/, and any other files)
- Check each file and each line against the official spec rules
- Score the skill across six dimensions (total 100 points)
- Generate a detailed Markdown report (in the user's language) with findings and prioritized improvement suggestions
Validation Process
Step 1: Discover the Skill Structure
First, understand what you're looking at. The user may provide:
- - A path to a skill directory (e.g.,
/path/to/my-skill/) - A path to a SKILL.md file (e.g.,
/path/to/my-skill/SKILL.md) - A GitHub URL (e.g.,
https://github.com/user/repo/tree/main/skills/my-skill) - A description of their skill in the conversation
For local paths: list all files in the directory recursively. For GitHub URLs: fetch the directory listing and then each file. For conversation-provided content: work with what's given.
Record:
- - Directory name (the folder containing SKILL.md)
- All files present and their sizes/line counts
- Whether SKILL.md exists
Step 2: Read Every File
Read the complete content of:
- 1.
SKILL.md — always required, read in full - Every file in
scripts/ — read each one - Every file in
references/ — read each one - Every file in
assets/ — note what's there (may not need to read binary files) - Any other files at the root level
Do not skip files. The goal is a thorough audit, not a surface-level check.
Step 3: Score Each Dimension
Load the detailed scoring rubric from references/scoring-rubric.md before scoring. Apply every rule carefully and record specific evidence for each finding (quote the exact line or value that passes or fails).
Score all six dimensions:
Dimension 1 — Directory Structure (10 points)
Dimension 2 — Frontmatter Compliance (30 points)
Dimension 3 — Body Content Quality (25 points)
Dimension 4 — Progressive Disclosure Design (15 points)
Dimension 5 — Optional Directory Quality (10 points)
Dimension 6 — Description Trigger Optimization (10 points)
For each finding, classify it as:
- - ✅ Pass — fully compliant
- ⚠️ Warning — not a hard rule violation but suboptimal
- ❌ Fail — violates the spec or a critical best practice
Step 4: Generate the Report
Output the complete Markdown report using the template in the Report Format section below. Be specific: quote actual values, line numbers, and file names. Do not write vague feedback like "description could be better" — write "The description is only 12 characters ('Helps with X'), which is too vague. It must describe both what the skill does AND when to use it, with specific trigger keywords."
Scoring Dimensions
Dimension 1: Directory Structure (10 points)
Check the following, reading references/spec-summary.md for the exact rules:
| Check | Points | Rule |
|---|
| SKILL.md exists in the skill root | 4 | Required by spec |
Directory name matches name frontmatter field |
3 | Spec requires exact match |
| Optional dirs used appropriately for their defined purpose | 2 | Each dir has a defined purpose |
| No unexpected files that violate the "principle of least surprise" | 1 | No malware, exploit code, or misleading content |
Standard directories — these are all legitimate and should not be flagged as unexpected:
- -
scripts/ — executable code for deterministic/repetitive tasks - INLINECODE9 — docs loaded into context as needed
- INLINECODE10 — files used in output (templates, icons, fonts)
- INLINECODE11 — test cases and evaluation data (used by skill-creator workflow)
- INLINECODE12 — instructions for specialized subagents (used by complex skills like skill-creator)
Scoring guidance:
- - If SKILL.md is missing: dimension score = 0, stop checking this dimension
- If directory name doesn't match
name: -3 points - If optional dirs contain content that doesn't match their purpose (e.g., documentation in scripts/): -1 per violation
- Do not penalize
evals/ or agents/ directories — they are standard and expected
Dimension 2: Frontmatter Compliance (30 points)
Parse the YAML frontmatter block (between the --- delimiters) and check every field.
name field (10 points)
| Check | Points |
|---|
| Field exists | 3 |
| Length is 1–64 characters |
2 |
| Contains only lowercase letters (a-z), digits (0-9), and hyphens (-) | 2 |
| Does not start or end with a hyphen | 1 |
| Does not contain consecutive hyphens (--) | 1 |
| Matches the parent directory name exactly | 1 |
description field (10 points)
| Check | Points |
|---|
| Field exists | 3 |
| Length is 1–1024 characters |
2 |
| Describes WHAT the skill does (not just a label) | 2 |
| Describes WHEN to use it (trigger conditions, use cases) | 2 |
| Contains specific trigger keywords (not just generic terms) | 1 |
Warning (not a point deduction, but flag it): If description is under 50 characters, it's almost certainly too vague. If it's over 800 characters, it may be too long for efficient triggering.
license field (3 points, optional)
| Check | Points |
|---|
| If present: value is a recognizable license name or references a bundled file | 2 |
| If present: format is concise (not a full license text inline) |
1 |
If absent: award 3 points (it's optional, absence is fine).
compatibility field (3 points, optional)
| Check | Points |
|---|
| If present: length is 1–500 characters | 1 |
| If present: describes actual environment requirements (not just "works everywhere") |
2 |
If absent: award 3 points (most skills don't need it).
metadata field (2 points, optional)
| Check | Points |
|---|
| If present: is a valid key-value mapping (not a list or scalar) | 1 |
| If present: keys are reasonably unique/namespaced |
1 |
If absent: award 2 points.
allowed-tools field (2 points, optional)
| Check | Points |
|---|
| If present: is a space-delimited list of tool names | 1 |
If present: tool names follow expected format (e.g., Bash(git:*), Read) |
1 |
If absent: award 2 points.
Dimension 3: Body Content Quality (25 points)
Read the full Markdown body (everything after the frontmatter ---).
| Check | Points | Guidance |
|---|
| Body has substantive content (not empty or just a title) | 5 | At least a few paragraphs of real instructions |
| Body is under 500 lines |
5 | 500 lines = full score; 500–600 = -2; 600–800 = -4; 800+ = 0 |
| Includes step-by-step instructions or a clear workflow | 5 | Not just a description of what the skill is |
| Uses imperative form ("Do X", "Run Y", "Check Z") | 3 | Passive or descriptive writing is weaker |
| Explains the "why" behind key instructions | 3 | Not just "MUST do X" but "Do X because Y" |
| Defines output format clearly (template, example, or schema) | 4 | User should know exactly what to expect |
Imperative form check: Scan for imperative verbs at the start of instruction sentences (Read, Run, Check, Use, Create, Write, Ensure, Avoid, etc.). If most instructions are passive ("The skill will...", "Claude should..."), flag it.
Why explanation check: Look for phrases like "because", "so that", "this helps", "the reason", "this ensures". A skill that only issues commands without rationale is harder for Claude to apply correctly in edge cases.
All-caps command word check: Scan for ALWAYS, NEVER, MUST, DO NOT, NEVER EVER used as standalone commands without explanation. From skill-creator: "If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag." A few instances are fine; a pattern of them suggests the skill is relying on brute-force commands instead of helping Claude understand the reasoning. Flag as ⚠️ if you find 3+ all-caps commands per 100 lines without accompanying rationale.
Dimension 4: Progressive Disclosure Design (15 points)
The spec defines three loading tiers:
- - Metadata (~100 words): name + description, always in context
- Instructions (<500 lines recommended): SKILL.md body, loaded on activation
- Resources (unlimited): scripts/, references/, assets/, loaded on demand
| Check | Points | Guidance |
|---|
| name + description together are concise (~100 words / ~500 characters) | 3 | If description alone is 150+ words, it's too heavy for metadata |
| SKILL.md body is under 500 lines |
4 | See body scoring above — this is a separate check for architecture |
| Large reference material is in references/ files, not inline in SKILL.md | 4 | If SKILL.md has 200+ line tables or reference docs, they should be in references/ |
| Reusable scripts are in scripts/, not copy-pasted inline in SKILL.md | 4 | If SKILL.md has 50+ line code blocks that could be scripts, flag it |
Note: A skill with no references/ or scripts/ can still score full points here if SKILL.md is appropriately sized. The question is whether the architecture fits the content.
Dimension 5: Optional Directory Quality (10 points)
Only score directories that exist. If a directory doesn't exist, award full points for it (absence is fine).
scripts/ (3 points, if present)
| Check | Points |
|---|
| Scripts are self-contained or clearly document their dependencies | 1 |
| Scripts include helpful error messages or --help output |
1 |
| Scripts handle edge cases gracefully (not just happy path) | 1 |
references/ (4 points, if present)
| Check | Points |
|---|
| Each reference file is focused on a single topic | 2 |
| Individual reference files are under 300 lines |
1 |
| Files are clearly referenced from SKILL.md with guidance on when to read them | 1 |
assets/ (3 points, if present)
| Check | Points |
|---|
| Assets are appropriate static resources (templates, images, data files) | 2 |
| Assets are actually referenced or used by the skill |
1 |
Dimension 6: Description Trigger Optimization (10 points)
The description field is the primary mechanism that determines whether Claude invokes a skill. This dimension evaluates it specifically for triggering effectiveness.
Pre-check — "when to use" belongs in description, not body: Before scoring, scan the SKILL.md body for sections titled "When to Use", "Trigger Conditions", "When This Skill Applies", or similar. If you find a dedicated section in the body explaining when to trigger the skill, flag it as ⚠️ Warning. From skill-creator: "All 'when to use' info goes here [in description], not in the body." The body should focus on how to do the task, not when to invoke the skill. Suggest moving that content into the description field.
| Check | Points | Guidance |
|---|
| Explicitly states when to use the skill (trigger conditions) | 3 | "Use when...", "Trigger when...", "TRIGGER when..." |
| Contains diverse trigger keywords covering different phrasings |
3 | Not just one way to ask, but multiple synonyms and contexts |
| Avoids being too broad (would trigger on unrelated tasks) | 2 | "Use for everything" is as bad as "Use for nothing" |
| Has appropriate "pushiness" to prevent undertriggering | 2 | The spec notes Claude tends to undertrigger; descriptions should be slightly assertive |
Pushiness check: Compare these two descriptions:
- - Weak: "Helps with PDF files."
- Strong: "Extracts text, fills forms, and merges PDFs. Use whenever the user mentions PDFs, forms, document extraction, or needs to work with PDF content — even if they don't explicitly say 'PDF skill'."
The strong version is more likely to trigger correctly.
Scoring Summary
After scoring all dimensions, calculate the total and assign a grade:
| Score | Grade | Meaning |
|---|
| 90–100 | Excellent | Fully compliant, production-ready |
| 75–89 |
Good | Minor improvements recommended |
| 60–74 | Acceptable | Needs improvement before publishing |
| 40–59 | Poor | Significant issues, rework required |
| 0–39 | Critical | Does not meet spec, major rewrite needed |
Report Format
Generate the report in this exact format. Fill in every section — do not skip sections even if there's nothing to report (write "None found." instead).
CODEBLOCK0
Approach and Mindset
The goal of this validation is to help skill authors improve their work, not to penalize them. Keep this in mind as you apply the scoring rubric.
Read everything before scoring. The reason to read every file — not just SKILL.md — is that a skill's quality often shows up in its scripts and reference files. A SKILL.md that looks thin might be appropriately thin because the complexity lives in well-organized references/. Conversely, a long SKILL.md might be hiding content that should have been split out. You can't judge the architecture without seeing all the pieces.
Quote specific evidence. Vague feedback like "the description could be better" doesn't help anyone. When you write "The description is 347 characters, includes 'when to use' guidance, and contains trigger keywords: 'PDF', 'forms', 'document extraction'" — that's actionable. The author knows exactly what they did right and can apply the same pattern elsewhere.
For every failure, show what right looks like. Don't just identify the problem — provide a concrete example of the fix. This is the difference between a report that gets filed away and one that gets acted on.
Handle edge cases gracefully:
- - If the skill is a single SKILL.md file with no directory: note this, score what you can, and explain what a full directory structure would add
- If the skill is a complex multi-directory repository: check each skill subdirectory separately if there are multiple
- If you can't access a file: note it as "unable to read" and explain why, then score conservatively
- If the frontmatter is malformed YAML: flag it as a critical failure and attempt to parse what you can
Composite skills: Some repositories contain multiple skills (e.g., skills/pdf/, skills/docx/). Validate each skill separately and provide a combined summary at the end — each skill stands on its own.
技能测试
一个用于根据官方Agent技能规范和anthropics/skills仓库的最佳实践来审计任何Agent技能的技能。
语言检测
检测用户请求的语言,并用该语言生成完整报告:
- - 如果用户使用中文(例如:检查我的skill、给这个skill打分、这个skill符合规范吗):用中文生成完整报告,包括所有章节标题、发现结果、改进建议和快速修复清单。
- 如果用户使用英文(例如:check my skill、score this skill、does this follow the spec):用英文生成完整报告。
- 如果语言不明确或混合:默认使用英文,但在报告顶部添加一条说明:注意:报告以英文生成。如果您希望获得中文报告,请用中文回复。
在整个报告中一致应用此语言选择——不要在单个报告中混合使用语言。
此技能的功能
给定一个技能目录或SKILL.md文件的路径,您将:
- 1. 读取技能目录中的每个文件(SKILL.md、scripts/、references/、assets/以及任何其他文件)
- 根据官方规范规则检查每个文件和每一行
- 在六个维度上对技能进行评分(总分100分)
- 生成详细的Markdown报告(使用用户的语言),包含发现结果和按优先级排序的改进建议
验证流程
步骤1:发现技能结构
首先,了解您正在查看的内容。用户可能提供:
- - 技能目录的路径(例如:/path/to/my-skill/)
- SKILL.md文件的路径(例如:/path/to/my-skill/SKILL.md)
- GitHub URL(例如:https://github.com/user/repo/tree/main/skills/my-skill)
- 对话中对其技能的描述
对于本地路径:递归列出目录中的所有文件。对于GitHub URL:获取目录列表,然后获取每个文件。对于对话中提供的内容:使用给定的内容进行操作。
记录:
- - 目录名称(包含SKILL.md的文件夹)
- 所有存在的文件及其大小/行数
- SKILL.md是否存在
步骤2:读取每个文件
读取以下内容的完整内容:
- 1. SKILL.md — 始终必需,完整读取
- scripts/中的每个文件 — 逐一读取
- references/中的每个文件 — 逐一读取
- assets/中的每个文件 — 记录存在的内容(可能不需要读取二进制文件)
- 根级别的任何其他文件
不要跳过文件。目标是进行彻底的审计,而不是表面检查。
步骤3:对每个维度进行评分
在评分之前,从references/scoring-rubric.md加载详细的评分标准。仔细应用每条规则,并为每个发现结果记录具体证据(引用通过或失败的确切行或值)。
对所有六个维度进行评分:
维度1 — 目录结构(10分)
维度2 — 前置元数据合规性(30分)
维度3 — 正文内容质量(25分)
维度4 — 渐进式披露设计(15分)
维度5 — 可选目录质量(10分)
维度6 — 描述触发优化(10分)
对于每个发现结果,将其分类为:
- - ✅ 通过 — 完全合规
- ⚠️ 警告 — 不违反硬性规则但不够优化
- ❌ 失败 — 违反规范或关键最佳实践
步骤4:生成报告
使用下方报告格式部分中的模板输出完整的Markdown报告。要具体:引用实际值、行号和文件名。不要写模糊的反馈,如描述可以更好——要写描述只有12个字符(帮助处理X),这太模糊了。它必须描述技能做什么以及何时使用它,并包含特定的触发关键词。
评分维度
维度1:目录结构(10分)
检查以下内容,阅读references/spec-summary.md了解确切规则:
| 检查项 | 分数 | 规则 |
|---|
| SKILL.md存在于技能根目录 | 4 | 规范要求 |
| 目录名称与name前置元数据字段匹配 |
3 | 规范要求精确匹配 |
| 可选目录按其定义用途适当使用 | 2 | 每个目录都有定义用途 |
| 没有违反最小意外原则的意外文件 | 1 | 没有恶意软件、利用代码或误导性内容 |
标准目录 — 这些都是合法的,不应标记为意外:
- - scripts/ — 用于确定性/重复性任务的可执行代码
- references/ — 根据需要加载到上下文中的文档
- assets/ — 输出中使用的文件(模板、图标、字体)
- evals/ — 测试用例和评估数据(由技能创建者工作流使用)
- agents/ — 专门子代理的指令(由复杂技能如技能创建者使用)
评分指导:
- - 如果缺少SKILL.md:维度分数 = 0,停止检查此维度
- 如果目录名称与name不匹配:-3分
- 如果可选目录包含与其用途不匹配的内容(例如,scripts/中的文档):每次违规-1分
- 不要对evals/或agents/目录扣分——它们是标准且预期的
维度2:前置元数据合规性(30分)
解析YAML前置元数据块(在---分隔符之间)并检查每个字段。
name字段(10分)
2 |
| 仅包含小写字母(a-z)、数字(0-9)和连字符(-) | 2 |
| 不以连字符开头或结尾 | 1 |
| 不包含连续连字符(--) | 1 |
| 与父目录名称完全匹配 | 1 |
description字段(10分)
2 |
| 描述技能做什么(不仅仅是标签) | 2 |
| 描述何时使用它(触发条件、用例) | 2 |
| 包含特定的触发关键词(不仅仅是通用术语) | 1 |
警告(不扣分,但标记它): 如果描述少于50个字符,几乎肯定太模糊了。如果超过800个字符,可能太长,不利于高效触发。
license字段(3分,可选)
| 检查项 | 分数 |
|---|
| 如果存在:值是可识别的许可证名称或引用捆绑文件 | 2 |
| 如果存在:格式简洁(不是内联的完整许可证文本) |
1 |
如果不存在:奖励3分(它是可选的,不存在也没问题)。
compatibility字段(3分,可选)
| 检查项 | 分数 |
|---|
| 如果存在:长度为1-500个字符 | 1 |
| 如果存在:描述实际环境要求(不仅仅是随处可用) |
2 |
如果不存在:奖励3分(大多数技能不需要它)。
metadata字段(2分,可选)
| 检查项 | 分数 |
|---|
| 如果存在:是有效的键值映射(不是列表或标量) | 1 |
| 如果存在:键合理唯一/命名空间化 |
1 |
如果不存在:奖励2分。
allowed-tools字段(2分,可选)
| 检查项 | 分数 |
|---|
| 如果存在:是空格分隔的工具名称列表 | 1 |
| 如果存在:工具名称遵循预期格式(例如,Bash(git:*)、Read) |
1 |
如果不存在:奖励2分。
维度3:正文内容质量(25分)
阅读完整的Markdown正文(前置元数据---之后的所有内容)。
| 检查项 | 分数 | 指导 |
|---|
| 正文有实质性内容(不是空的或只是一个标题) | 5 | 至少几段真实指令 |
| 正文少于500行 |
5 | 500行 = 满分;500-600 = -2;600-800 = -4;800+ = 0 |
| 包含逐步指令或清晰的工作流程 | 5 | 不仅仅是技能是什么的描述 |
| 使用祈使形式(做X、运行Y、检查Z) | 3 | 被动或描述性写法较弱 |
| 解释关键指令背后的原因 | 3 | 不仅仅是必须做X而是做X因为Y |
| 清晰定义输出格式(模板、示例或模式) | 4 | 用户应该确切知道期望什么 |
祈使形式检查: 扫描指令句子开头的祈使动词(读取、运行、检查、使用、创建