autooptimise
Autonomous benchmark-driven skill optimisation for OpenClaw. Inspired by Andrej Karpathy's autoresearch — the same modify → test → score → keep/discard loop, applied to agent skill quality instead of GPU training.
Trigger Phrases
- - INLINECODE0
- INLINECODE1
- INLINECODE2
- INLINECODE3
Key Files
| File | Purpose |
|---|
| INLINECODE4 | Test task suite (prompts + expected qualities) |
| INLINECODE5 |
LLM judge scoring rubric |
|
runner/run_experiment.md | Autonomous loop instructions (load this next) |
|
runner/experiment_log.md | Auto-created run log (gitignored) |
How to Run
- 1. Read
runner/run_experiment.md — it contains the full loop instructions - Confirm the target skill with the user if not specified
- Execute the loop (max 3 iterations)
- Present proposed changes for human approval — never auto-apply
Scoring
Use the best available LLM judge model (prefer a strong reasoning model). Score each task 0–10 on:
- - Accuracy — correct answer / correct tool called
- Conciseness — no padding, no unnecessary text
- Tool usage — right tool, right parameters
- Formatting — output matches expected format
Full rubric: INLINECODE9
Safety Rules
- - Never auto-apply changes. Always present a diff and wait for explicit human approval.
- Never modify
benchmark/tasks.json or benchmark/scorer.md during a run. - Never exceed 3 iterations per run in v0.1.
- Log every action to
runner/experiment_log.md.
autooptimise
针对OpenClaw的自主基准驱动技能优化。灵感来源于Andrej Karpathy的autoresearch——相同的修改→测试→评分→保留/丢弃循环,但应用于智能体技能质量而非GPU训练。
触发短语
- - 优化我的天气技能
- 对[技能名称]运行autooptimise
- 对我的[技能名称]技能进行基准测试
- 通宵改进我的技能
关键文件
| 文件 | 用途 |
|---|
| benchmark/tasks.json | 测试任务套件(提示词+预期质量) |
| benchmark/scorer.md |
LLM评判评分标准 |
| runner/run_experiment.md | 自主循环指令(下一步加载此文件) |
| runner/experiment_log.md | 自动创建的运行日志(已加入gitignore) |
运行方式
- 1. 阅读runner/run_experiment.md——其中包含完整的循环指令
- 若未指定目标技能,则与用户确认
- 执行循环(最多3次迭代)
- 提交修改方案供人工审批——切勿自动应用
评分标准
使用可用的最佳LLM评判模型(优先选择强推理模型)。每项任务按0-10分评分,评估维度包括:
- - 准确性——正确答案/正确调用工具
- 简洁性——无填充内容,无多余文本
- 工具使用——正确的工具和参数
- 格式规范——输出符合预期格式
完整评分标准:benchmark/scorer.md
安全规则
- - 切勿自动应用修改。 始终展示差异对比并等待明确的人工审批。
- 运行期间切勿修改 benchmark/tasks.json 或 benchmark/scorer.md。
- v0.1版本每次运行不得超过3次迭代。
- 将每次操作记录到runner/experiment_log.md中。