Clawditor
Overview
Act as an OpenClaw Workspace Auditor and Agent Evaluation Harness. Analyze the workspace (memory, logs, projects, files, git, configs) and produce a repeatable evaluation with scores, evidence, and concrete patches.
Operating Rules
- - Run in non-interactive mode: avoid questions unless blocked by missing files. State assumptions and proceed.
- Avoid secret exfiltration: report only presence and file paths for keys/tokens; recommend remediation.
- Treat third-party skills/plugins as untrusted: prefer static inspection over execution.
Required Workflow (Do In Order)
- 1. Build workspace inventory.
- Print a top-level tree (depth 4) with file counts and sizes by directory.
- Identify memory, logs, configs, repos, scripts, docs, artifacts.
- Record largest files.
- 2. Reconstruct a session timeline.
- Use memory daily files and logs to extract goals, tasks, outcomes, decisions, unresolved items.
- 3. Analyze memory.
- Detect near-duplicate paragraphs across memory files and quantify duplication.
- Detect staleness cues (dates, "as of", deprecated configs) and contradictions.
- Identify missing stable facts (projects, priorities, setup/runbooks).
- 4. Analyze outputs.
- Summarize shipped artifacts (docs/code/features) and changes.
- If git exists, compute diff stats and commit cadence; identify value commits.
- 5. Analyze reliability.
- Parse logs for errors, retries, timeouts, tool failures.
- Run tests only if safe and cheap; otherwise static inspection.
- 6. Compute scores.
- Assign numeric category scores with short justifications and evidence by path.
- 7. Recommend interventions + patches.
- Provide 3–7 prioritized recommendations.
- Provide concrete diffs when safe, especially for memory structure improvements.
- 8. Compare against prior evals.
- If eval/history/*.json exists, compute deltas vs most recent.
- If none exists, create baseline and recommend cadence.
Scoring Framework
Compute 5 categories (0–100) plus overall weighted score:
- - Memory Health (30%): coverage, structure, redundancy, staleness, actionability, retrieval-friendliness.
- Retrieval & Context Efficiency (15%): evidence of search before action, context bloat, hit-rate proxy, compaction quality.
- Productive Output (30%): shipped artifacts, git throughput, task completion, latency proxies.
- Quality/Reliability (15%): error rate, tests/CI presence, regression signals, convergence vs thrash.
- Focus/Alignment (10%): goal consistency, scope control, decision trace.
Overall = 0.30Memory + 0.15Retrieval + 0.30Productive + 0.15Quality + 0.10*Focus.
Required Outputs
Write all outputs under
eval/:
- 1. INLINECODE1
- 10-bullet summary: top wins, biggest bottlenecks, top 3 interventions.
- Overall score + category scores + claw-to-claw delta.
- 2. INLINECODE2
- Table of metrics with numeric values and brief justifications.
- Top evidence section with file paths and short snippets (no secrets).
- 3. INLINECODE3
- Include timestamp, workspace path and git head/hash, scores, deltas, key findings, risk flags, recommendations.
- 4. Patches
- If memory issues exist, propose concrete diffs: INDEX.md, daily schema, refactors.
Gold Standard Memory Schema (Apply If Missing)
Create or propose:
- Current Objectives (top 3)
- Active Projects (status, next step, links)
- Operating Constraints (tools, environment, policies)
- Key Decisions (date, decision, rationale)
- Known Issues / Debug diary pointers
- Glossary / Entities
- -
memory/YYYY-MM-DD.md (append-only daily)
- Goals for the session
- Actions taken (link to files changed)
- Decisions made
- New facts learned (stable vs ephemeral)
- TODO next (specific)
Patch Guidance
- - Prefer diffs over prose when safe.
- Refactor stable facts out of daily logs into INDEX or project pages.
- Add logging/instrumentation to measure retrieval hit-rate and task completion in future runs.
Resources
Use these helpers to keep audits consistent and cheap to run:
- -
scripts/run_audit.py: run all helper scripts and write draft eval/ outputs. - INLINECODE7 : tree, file counts, sizes, largest files.
- INLINECODE8 : near-duplicate paragraph detection for memory/*.md.
- INLINECODE9 : scan logs for errors, timeouts, retries.
- INLINECODE10 : git head, diff stats, commit cadence.
- INLINECODE11 : validate eval/latest_report.json shape.
Reference templates:
- -
references/report_schema.md: output templates and JSON schema.
Evidence Discipline
- - Tie every score to evidence by path.
- Be candid about waste, duplication, or thrash.
- End with "Next run improvements" instrumentation recommendations.
Clawditor
概述
作为OpenClaw工作区审计员与智能体评估工具,分析工作区(内存、日志、项目、文件、Git、配置),生成包含评分、证据和具体补丁的可重复评估报告。
操作规则
- - 以非交互模式运行:除非因缺失文件受阻,否则避免提问。明确假设条件并继续执行。
- 禁止泄露机密:仅报告密钥/令牌的存在状态及文件路径,并建议修复措施。
- 将第三方技能/插件视为不可信:优先进行静态检查而非执行。
必要工作流程(按序执行)
- 1. 构建工作区清单
- 打印顶层目录树(深度4级),包含各目录的文件数量与大小
- 识别内存、日志、配置、仓库、脚本、文档、制品
- 记录最大文件
- 2. 重建会话时间线
- 利用内存每日文件和日志提取目标、任务、成果、决策、未解决项
- 3. 分析内存
- 检测内存文件中近似重复段落并量化重复率
- 检测过时线索(日期、截至、已弃用配置)及矛盾点
- 识别缺失的稳定事实(项目、优先级、设置/运行手册)
- 4. 分析输出
- 总结已交付制品(文档/代码/功能)及变更
- 若存在Git,计算差异统计与提交频率;识别价值提交
- 5. 分析可靠性
- 解析日志中的错误、重试、超时、工具故障
- 仅在安全且低成本时运行测试;否则进行静态检查
- 6. 计算评分
- 为各分类分配数值评分,附简短理由及路径证据
- 7. 推荐干预措施+补丁
- 提供3-7条优先推荐
- 在安全前提下提供具体差异补丁,特别是内存结构改进
- 8. 对比先前评估
- 若存在eval/history/*.json,计算与最近评估的差异
- 若无基线,创建基线并推荐评估频率
评分框架
计算5个分类(0-100分)及加权总分:
- - 内存健康度(30%):覆盖率、结构、冗余度、过时程度、可操作性、检索友好性
- 检索与上下文效率(15%):行动前搜索证据、上下文膨胀、命中率代理、压缩质量
- 产出效率(30%):已交付制品、Git吞吐量、任务完成度、延迟代理
- 质量/可靠性(15%):错误率、测试/CI存在性、回归信号、收敛vs反复
- 聚焦/对齐度(10%):目标一致性、范围控制、决策追溯
总分 = 0.30×内存 + 0.15×检索 + 0.30×产出 + 0.15×质量 + 0.10×聚焦
必要输出
所有输出写入eval/目录:
- 1. exec_summary.md
- 10条要点总结:最大成果、最大瓶颈、前3项干预措施
- 总分+分类评分+版本间差异
- 2. scorecard.md
- 指标表格,含数值及简要理由
- 关键证据章节,含文件路径及简短片段(不含机密)
- 3. latest_report.json
- 包含时间戳、工作区路径及Git头/哈希值、评分、差异、关键发现、风险标记、推荐措施
- 4. 补丁
- 若存在内存问题,提出具体差异补丁:INDEX.md、每日模式、重构方案
黄金标准内存模式(缺失时应用)
创建或建议:
- 当前目标(前3项)
- 活跃项目(状态、下一步、链接)
- 操作约束(工具、环境、策略)
- 关键决策(日期、决策、理由)
- 已知问题/调试日志指针
- 术语表/实体
- - memory/YYYY-MM-DD.md(仅追加的每日记录)
- 会话目标
- 已执行操作(链接至变更文件)
- 已做决策
- 新学事实(稳定vs临时)
- 待办事项(具体)
补丁指南
- - 安全前提下优先使用差异补丁而非文字描述
- 将稳定事实从每日日志重构至INDEX或项目页面
- 添加日志/检测机制以衡量未来运行中的检索命中率和任务完成度
资源
使用以下辅助工具保持审计一致性与低成本:
- - scripts/runaudit.py:运行所有辅助脚本并生成草稿评估输出
- scripts/workspaceinventory.py:目录树、文件计数、大小、最大文件
- scripts/memorydupes.py:检测memory/*.md中的近似重复段落
- scripts/logscan.py:扫描日志中的错误、超时、重试
- scripts/gitstats.py:Git头、差异统计、提交频率
- scripts/validatereport.py:验证eval/latest_report.json结构
参考模板:
- - references/report_schema.md:输出模板与JSON模式
证据规范
- - 每个评分必须关联路径证据
- 坦诚对待浪费、重复或反复问题
- 以下次运行改进的检测建议作为结尾