ClawCheck
Two-phase audit: a fast deterministic scan catches structural issues, then you (the agent) do a deep quality evaluation on the flagged areas.
When to Use
- - After initial setup or major config changes
- Before publishing skills to ClawHub (quality gate)
- Periodic health check (weekly cron or manual)
- When something feels off but
openclaw doctor says "ok" - After installing new skills or updating OpenClaw
What This Checks vs Built-in
| This skill | INLINECODE1 (built-in) |
|---|
| Secrets exposure + token hygiene | Config JSON schema validation |
| Cron ops health + prompt quality review |
Plugin/skill eligibility |
| Config optimization + value assessment | Channel connectivity |
| Skill structural + content quality audit | State migrations, browser detection |
How It Works: Two Phases
Phase 1: Deterministic Scan (fast, free)
Run the script to get a structural baseline:
CODEBLOCK0
Individual modules:
CODEBLOCK1
This produces JSON with scores, findings, and the bottom/top skill lists. Use this as your triage map for Phase 2.
Phase 2: Deep Quality Audit (you, the agent)
After running the script, perform these evaluations. Budget your depth based on what the user asked for ("quick check" = Phase 1 only, "full audit" or "quality review" = both phases).
2a. Config Quality Review
Read ~/.openclaw/openclaw.json and evaluate:
- - Heartbeat prompt: Read
agents.defaults.heartbeat.prompt. Is it specific enough to catch real issues? Does it avoid heavy operations? A good heartbeat prompt is < 200 words, checks 2-3 things, and has clear escalation criteria. - Model choices: Is the primary model appropriate for the workload? Are fallbacks a meaningful step-down (not the same tier)? Is the subagent model cheaper than primary?
- Compaction thresholds: Are
reserveTokens and keepRecentTokens reasonable for the context window size? Rule of thumb: reserve should be 15-20% of contextTokens. - Session maintenance: Are
pruneAfter, maxEntries, rotateBytes set to values that match the usage pattern? Heavy cron usage needs more aggressive pruning. - Cron maxConcurrentRuns: Is it high enough for the number of frequent jobs? Count jobs with
*/ in their schedule expression.
Score each aspect 1-5. Report specific improvements.
2b. Cron Prompt Quality Review
Read ~/.openclaw/cron/jobs.json. Select the 5 most important enabled jobs using this heuristic:
- 1. Any job in error state (from Phase 1 findings)
- Jobs with highest
frequency x timeoutSeconds (most resource-consuming) - Jobs running on expensive models (opus/primary)
- If still under 5, pick by business impact (backups, monitoring, user-facing)
For each selected job evaluate:
- - Prompt clarity: Specific enough to execute without guessing? Clear steps, expected output format, error handling?
- Safety: Has guardrails? ("NEVER run git push", "read-only", "do not edit files directly")
- Efficiency: Token-efficient? Flag prompts > 1500 chars that run on expensive models. Could the prompt reference a skill file instead of inlining instructions?
- Output value: Produces actionable output or just noise?
- Timeout:
payload.timeoutSeconds set and reasonable for scope?
Score each job 1-5 on: purpose, prompt quality, safety, efficiency. Flag jobs scoring below 3.
Cross-reference: Check if any cron prompts reference skills that scored below 70 in Phase 1. A cron job is only as reliable as the skills it depends on.
2c. Skill Content Quality Review
From the Phase 1 results, pick:
- - The 3 lowest-scoring skills (from
bottom_5) - Any skills the user specifically asks about
- Skills used by failing cron jobs (cross-reference cron findings)
For each selected skill, read its full SKILL.md and evaluate:
- - Accuracy (2x weight): Would following these instructions produce correct behavior? Are API references current? Are file paths real?
- Completeness (1.5x): Are all use cases covered? Edge cases? What happens when dependencies are missing?
- Clarity (1x): Can an agent follow this without ambiguity? No hedging, clear steps, good examples?
- Efficiency (1x): Is the SKILL.md bloated? Could it be shorter without losing information? Does it suggest efficient patterns (batching, caching)?
- Voice alignment (1x, content-producing skills only): Does the output match the brand/user's tone?
Scoring formula depends on skill type:
- - Content/marketing skills (has voice component): INLINECODE14
- Utility/tool skills (no voice): INLINECODE15
For skills scoring below 4.0, write specific improvement recommendations with concrete examples.
2d. Security Assessment
Phase 1 now scans workspace files for common secret patterns (sk-, ghp_, AIzaSy, Bearer tokens, hex private keys, etc.). In Phase 2, go deeper:
- - Review any secrets the script found in workspace files. Are they real credentials or false positives (e.g., example/placeholder values)?
- Check if any skill
scripts/ contain hardcoded credentials or API URLs with embedded tokens - Check if
.env files exist inside skill directories - Look for credentials in cron job prompts (some prompts inline API keys instead of referencing env vars)
- Check if any workspace knowledge files contain customer data, passwords, or access tokens
Output Format
Phase 1 (script output)
CODEBLOCK2
Phase 2 (your evaluation)
Present as a readable report to the user:
CODEBLOCK3
Scoring Weights (Phase 1 script)
Security 30%, cron 25%, config 20%, skills 25%.
Skill structure formula: INLINECODE18
Remediation
For detailed fix patterns with real config examples, see {baseDir}/references/remediation.md.
Quick fixes for common findings:
Inline secrets
CODEBLOCK4
Plaintext bot token
CODEBLOCK5
Missing heartbeat
CODEBLOCK6
Missing timezone on cron
CODEBLOCK7
Error Handling
- - If OpenClaw dir not found: script exits with error JSON and exit code 1.
- If
openclaw.json is missing or invalid: script exits with error JSON. - If individual module fails: caught and reported as warning, other modules still run.
- If bundled skills dir not accessible: skipped silently.
- Phase 2 failures: if you can't read a file, note it and move on. Don't stop the whole audit.
Non-Goals
- - No direct edits to config or skills (report only, user decides)
- No network calls (everything is local file inspection)
- No overlap with
openclaw doctor schema validation or channel connectivity checks
ClawCheck
两阶段审计:快速确定性扫描捕获结构性问题,然后由您(代理)对标记区域进行深度质量评估。
使用时机
- - 初始设置或重大配置更改后
- 将技能发布到 ClawHub 前(质量门禁)
- 定期健康检查(每周定时任务或手动)
- 当感觉异常但 openclaw doctor 显示正常时
- 安装新技能或更新 OpenClaw 后
本技能检查内容与内置工具对比
| 本技能 | openclaw doctor(内置) |
|---|
| 密钥泄露 + 令牌卫生 | 配置 JSON 模式验证 |
| 定时任务运行状况 + 提示词质量审查 |
插件/技能资格 |
| 配置优化 + 价值评估 | 通道连接性 |
| 技能结构 + 内容质量审计 | 状态迁移、浏览器检测 |
工作原理:两个阶段
阶段 1:确定性扫描(快速,免费)
运行脚本获取结构基线:
bash
python3 {baseDir}/scripts/audit.py
单个模块:
bash
python3 {baseDir}/scripts/audit.py --security
python3 {baseDir}/scripts/audit.py --cron
python3 {baseDir}/scripts/audit.py --config
python3 {baseDir}/scripts/audit.py --skills
这将生成包含评分、发现结果以及底部/顶部技能列表的 JSON。将其用作阶段 2 的分诊地图。
阶段 2:深度质量审计(由您,代理执行)
运行脚本后,执行以下评估。根据用户要求调整深度(快速检查=仅阶段 1,全面审计或质量审查=两个阶段)。
2a. 配置质量审查
读取 ~/.openclaw/openclaw.json 并评估:
- - 心跳提示词:读取 agents.defaults.heartbeat.prompt。是否足够具体以捕获真实问题?是否避免繁重操作?好的心跳提示词应少于 200 词,检查 2-3 项内容,并有明确的升级标准。
- 模型选择:主要模型是否适合工作负载?备用模型是否为有意义的降级(非同一层级)?子代理模型是否比主要模型更便宜?
- 压缩阈值:reserveTokens 和 keepRecentTokens 对于上下文窗口大小是否合理?经验法则:保留量应为 contextTokens 的 15-20%。
- 会话维护:pruneAfter、maxEntries、rotateBytes 是否设置为匹配使用模式的值?大量定时任务使用需要更积极的修剪。
- 定时任务 maxConcurrentRuns:对于频繁作业的数量是否足够高?统计计划表达式中包含 */ 的作业。
每项评分 1-5 分。报告具体的改进建议。
2b. 定时任务提示词质量审查
读取 ~/.openclaw/cron/jobs.json。使用以下启发式方法选择 5 个最重要的已启用作业:
- 1. 任何处于错误状态的作业(来自阶段 1 的发现)
- frequency x timeoutSeconds 最高的作业(最消耗资源)
- 在昂贵模型(opus/主要)上运行的作业
- 如果仍不足 5 个,按业务影响选择(备份、监控、面向用户)
对每个选定的作业评估:
- - 提示词清晰度:是否足够具体以无需猜测即可执行?清晰的步骤、预期的输出格式、错误处理?
- 安全性:是否有防护措施?(绝不运行 git push、只读、不要直接编辑文件)
- 效率:是否节省令牌?标记在昂贵模型上运行且超过 1500 字符的提示词。提示词是否可以引用技能文件而非内联指令?
- 输出价值:是否产生可操作的输出或只是噪音?
- 超时:payload.timeoutSeconds 是否已设置且对范围合理?
对每个作业在以下方面评分 1-5 分:目的、提示词质量、安全性、效率。标记评分低于 3 的作业。
交叉引用:检查是否有任何定时任务提示词引用了阶段 1 中评分低于 70 的技能。定时任务的可靠性取决于其所依赖的技能。
2c. 技能内容质量审查
从阶段 1 结果中选择:
- - 评分最低的 3 个技能(来自 bottom_5)
- 用户特别询问的任何技能
- 失败定时任务所使用的技能(交叉引用定时任务发现)
对每个选定的技能,读取其完整的 SKILL.md 并评估:
- - 准确性(权重 2 倍):遵循这些指令是否能产生正确的行为?API 引用是否最新?文件路径是否真实?
- 完整性(权重 1.5 倍):是否涵盖所有用例?边缘情况?当依赖缺失时会发生什么?
- 清晰度(权重 1 倍):代理能否无歧义地遵循?没有含糊其辞,清晰的步骤,好的示例?
- 效率(权重 1 倍):SKILL.md 是否臃肿?能否在不丢失信息的情况下更短?是否建议高效的模式(批处理、缓存)?
- 语音一致性(权重 1 倍,仅限内容生成技能):输出是否与品牌/用户的语气匹配?
评分公式取决于技能类型:
- - 内容/营销技能(有语音组件):(accuracy2 + completeness1.5 + clarity + efficiency + voice) / 6.5
- 实用/工具技能(无语音):(accuracy2 + completeness1.5 + clarity + efficiency) / 5.5
对于评分低于 4.0 的技能,编写具体的改进建议并附上具体示例。
2d. 安全评估
阶段 1 现在扫描工作区文件中的常见密钥模式(sk-、ghp_、AIzaSy、Bearer 令牌、十六进制私钥等)。在阶段 2 中,进行更深入的检查:
- - 审查脚本在工作区文件中发现的任何密钥。它们是真实的凭据还是误报(例如,示例/占位符值)?
- 检查是否有任何技能 scripts/ 包含硬编码的凭据或嵌入令牌的 API URL
- 检查技能目录中是否存在 .env 文件
- 在定时任务提示词中查找凭据(某些提示词内联 API 密钥而非引用环境变量)
- 检查是否有任何工作区知识文件包含客户数据、密码或访问令牌
输出格式
阶段 1(脚本输出)
json
{
score: 82,
score
type: structuralhygiene,
status: healthy,
sections: {
security: {score: 65, finding_count: 3},
cron: {score: 95, finding_count: 1},
config: {score: 88, finding_count: 2},
skills: {score: 80, finding_count: 1}
},
findings: [...]
}
阶段 2(您的评估)
以可读报告形式呈现给用户:
ClawCheck 报告
结构基线(阶段 1)
总体:82/100(健康)
安全:65 | 定时任务:95 | 配置:88 | 技能:80
深度质量发现(阶段 2)
配置:
- - 心跳提示词:4/5(清晰但可在关键时添加 Telegram 警报)
- 模型选择:5/5(opus 主要,sonnet 备用,sonnet 子代理)
- 压缩:4/5(800k 上下文中 reserveTokens=150k = 19%,良好)
定时任务(主要关注点):
- - 早间简报(3/5):提示词 400 词但缺少输出格式规范
- 前沿扫描器(2/5):无安全防护措施,无错误处理
技能(底部 3 个):
- - marketing-automation:损坏(无 SKILL.md)
- apple-notes(结构 62/100):[内容评估]
- blucli(结构 62/100):[内容评估]
建议操作(按优先级排序)
- 1. [最有影响力的修复]
- [下一个修复]
- [下一个修复]
评分权重(阶段 1 脚本)
安全 30%,定时任务 25%,配置 20%,技能 25%。
技能结构公式:(structure2 + completeness1.5 + clarity + efficiency) / 5.5 * 20
修复方案
有关带有真实配置示例的详细修复模式,请参阅 {baseDir}/references/remediation.md。
常见发现的快速修复:
内联密钥
json
GAMMA
APIKEY: {source: exec, provider: op-gamma, id: value}
明文机器人令牌
json
botToken: {source: exec, provider: op-telegram, id: value