Meta-Harness Evolver
What This Skill Does
Implements the Meta-Harness paper's outer-loop optimization for Hoss — your OpenClaw agent. Each night at 3 AM CDT, this skill:
- 1. Reads Hoss's current workspace configs + all prior evolution logs
- Proposes a targeted harness modification via a coding-agent sub-agent
- Evaluates the proposed harness against a benchmark of ~20 diverse task scenarios
- Logs the candidate harness + scores + execution traces to the evolution filesystem
- Posts a summary report to #research Discord channel
The Meta-Harness Loop
CODEBLOCK0
Quick Start
Cron Schedule
- - 3 AM CDT daily — configured via INLINECODE0
- Cron command: INLINECODE1
Manual Trigger
CODEBLOCK1
Directory Structure
CODEBLOCK2
What Can Be Evolved
Hoss's "harness" = the configs that wrap the LLM brain:
| File | What It Controls |
|---|
| INLINECODE2 | Core identity, personality, decision-making style |
| INLINECODE3 |
Role, voice, tone, signature patterns |
|
AGENTS.md | Sub-agent architecture, coordination protocol |
|
TOOLS.md | Tool configurations, credentials, key hosts |
|
MEMORY.md | Long-term memory structure, what to persist |
|
HEARTBEAT.md | Active hours, check priorities, alert thresholds |
Constraints (do NOT modify):
- - Credentials, API keys, or secrets in TOOLS.md
- Git safety rules (NEVER mutate git config from ~/flume/)
- Security-sensitive groupPolicy settings
The Evolution Algorithm
- 1. Seed: Start with Hoss's current configs as iteration 0
- Propose: Sub-agent reads full history from ~/hoss-evolution/candidates/, identifies failure patterns, proposes 1-2 targeted edits
- Validate: Lightweight import/syntax check before running full benchmark
- Evaluate: Run proposed harness against all 20 benchmark scenarios, score each
- Log: Store candidate harness + scores + proposer reasoning traces
- Select: Pareto frontier over (performance, simplicity) — proposer decides which candidates to keep exploring from
- Repeat: Next night's proposer can read ALL prior candidates to build on good ideas
Key Insight from the Paper
The
skill text is the strongest lever — it steers the proposer. Iterating on the proposer's prompt/role description had more effect than changing iteration count or population size.
The Benchmark
The benchmark lives at ~/hoss-evolution/benchmark/. See references/benchmark-design.md for how to design scenarios and references/harness-spec.md for the full harness spec.
Default benchmark has 20 scenarios across categories:
- - Memory: Recall, update, synthesize from memory files
- Code: Write, review, debug code tasks
- Coordination: Spawn sub-agents, synthesize results
- Research: Web search, fetch, summarize, synthesize
- Communication: Draft emails, Discord messages, iMessages
- Quality: Spot errors, inconsistencies, broken links
Each scenario has:
- - A concrete task description
- Expected outcome criteria
- A scoring rubric (0-3 per scenario: fail / partial / pass / excellent)
The Proposer Agent
The proposer is a coding-agent sub-agent (default: coder) that:
- - Reads all prior candidates from
~/hoss-evolution/candidates/ via filesystem ops - Identifies patterns in failed/succeeded candidates
- Proposes targeted, specific edits (NOT wholesale rewrites)
- Writes proposed configs to the new candidate directory
- Logs its reasoning trace so future iterations can build on it
Proposer Skill (passed to sub-agent)
The proposer's role is defined by the task prompt in scripts/propose_harness.py. Key constraints:
- - Can only propose edits to files in the harness spec (SOUL.md, IDENTITY.md, AGENTS.md, TOOLS.md, MEMORY.md, HEARTBEAT.md)
- Must pass lightweight validation before full evaluation
- Should prefer targeted edits over full rewrites
- Must log reasoning trace to proposer/logs/
Workflow Steps
Step 1: Read Prior Candidates
CODEBLOCK3
Step 2: Run Proposer
CODEBLOCK4
Step 3: Validate Before Benchmark
CODEBLOCK5
Step 4: Run Benchmark
CODEBLOCK6
Step 5: Log Results
CODEBLOCK7
Step 6: Post to Discord
CODEBLOCK8
Scoring
Final score = weighted average across scenarios:
- - Memory tasks: 25%
- Code tasks: 25%
- Coordination: 15%
- Research: 20%
- Communication: 10%
- Quality: 5%
Results are tracked as a Pareto frontier: for each candidate, log both score and "complexity" (size/diff of changes). Simpler harnesses that score equally get priority.
Resources
Notes
- - The proposer sub-agent runs with
runtime=subagent, not ACP — it needs filesystem access to ~/hoss-evolution/ - Cron is configured outside this skill via INLINECODE12
- If the proposer fails to produce a valid candidate, the iteration is skipped (no penalty)
- Benchmark scenarios should be diverse enough that no single strategy can game all of them
- The evolution workspace is NOT inside ~/.openclaw/ — it's at ~/hoss-evolution/ to keep it separate from operational configs
元-框架进化器
该技能的功能
为Hoss(您的OpenClaw智能体)实现元-框架论文中的外循环优化。每天凌晨3点(美国中部时间),该技能将:
- 1. 读取 Hoss当前的工作空间配置及所有先前的进化日志
- 通过编码智能体子智能体提出针对性的框架修改方案
- 评估所提出的框架在约20个多样化任务场景基准测试中的表现
- 记录候选框架及其得分和执行轨迹到进化文件系统
- 发布摘要报告到#research Discord频道
元-框架循环
提议智能体 ──(文件系统访问)──► Hoss工作空间
▲ │
│ 提出框架
│ ▼
│ 在基准测试中评估
│ ▼
日志 ───┴── 存储:代码 + 得分 + 轨迹 ──► ~/hoss-evolution/
快速开始
Cron计划
- - 每天凌晨3点(美国中部时间) — 通过 openclaw cron 配置
- Cron命令:SKILL=meta-harness-evolution TASK=run_evolution openclaw run
手动触发
/openclaw run --skill meta-harness-evolver --task run_evolution
目录结构
~/hoss-evolution/
├── best/ # 迄今为止找到的最佳框架
│ └── current/
├── candidates/ # 所有已评估的框架
│ └── candidate_N/ # 每个候选框架一个目录
│ ├── harness/ # 提议的配置文件(SOUL.md等)
│ ├── eval_scores.json
│ └── traces/ # 执行轨迹
├── benchmark/ # 评估任务+评分器
│ └── scenarios/ # 约20个多样化任务场景
├── proposer/ # 提议智能体的工作空间
│ └── logs/ # 提议智能体自身的推理轨迹
└── evolution_log.jsonl # 完整运行历史
可进化的内容
Hoss的框架=包裹LLM大脑的配置:
| 文件 | 控制内容 |
|---|
| SOUL.md | 核心身份、个性、决策风格 |
| IDENTITY.md |
角色、语气、语调、签名模式 |
| AGENTS.md | 子智能体架构、协调协议 |
| TOOLS.md | 工具配置、凭证、关键主机 |
| MEMORY.md | 长期记忆结构、持久化内容 |
| HEARTBEAT.md | 活跃时间、检查优先级、告警阈值 |
约束条件(请勿修改):
- - TOOLS.md中的凭证、API密钥或密钥
- Git安全规则(切勿修改~/flume/中的git配置)
- 安全敏感的groupPolicy设置
进化算法
- 1. 种子:以Hoss当前配置作为第0次迭代
- 提议:子智能体从~/hoss-evolution/candidates/读取完整历史,识别失败模式,提出1-2个针对性编辑
- 验证:在运行完整基准测试前进行轻量级导入/语法检查
- 评估:在所有20个基准测试场景中运行提议的框架,对每个场景评分
- 记录:存储候选框架+得分+提议智能体推理轨迹
- 选择:帕累托前沿(性能,简洁性)——提议智能体决定从哪些候选框架继续探索
- 重复:次晚的提议智能体可以读取所有先前候选框架,基于好的想法继续构建
论文关键见解
技能文本是最强的杠杆——它引导提议智能体。迭代提议智能体的提示/角色描述比改变迭代次数或种群规模效果更显著。
基准测试
基准测试位于 ~/hoss-evolution/benchmark/。请参阅 references/benchmark-design.md 了解如何设计场景,以及 references/harness-spec.md 了解完整的框架规范。
默认基准测试包含20个场景,涵盖以下类别:
- - 记忆:从记忆文件中回忆、更新、综合
- 代码:编写、审查、调试代码任务
- 协调:生成子智能体、综合结果
- 研究:网络搜索、获取、总结、综合
- 通信:起草邮件、Discord消息、iMessages
- 质量:发现错误、不一致、断链
每个场景包含:
- - 具体的任务描述
- 预期结果标准
- 评分标准(每个场景0-3分:失败/部分通过/通过/优秀)
提议智能体
提议智能体是一个编码智能体子智能体(默认:编码器),它:
- - 通过文件系统操作从 ~/hoss-evolution/candidates/ 读取所有先前候选框架
- 识别失败/成功候选框架中的模式
- 提出针对性、具体的编辑(而非全面重写)
- 将提议的配置写入新的候选目录
- 记录其推理轨迹,以便未来迭代可以在此基础上构建
提议智能体技能(传递给子智能体)
提议智能体的角色由 scripts/propose_harness.py 中的任务提示定义。关键约束:
- - 只能对框架规范中的文件(SOUL.md、IDENTITY.md、AGENTS.md、TOOLS.md、MEMORY.md、HEARTBEAT.md)提出编辑
- 在全面评估前必须通过轻量级验证
- 应优先进行针对性编辑而非全面重写
- 必须将推理轨迹记录到 proposer/logs/
工作流程步骤
步骤1:读取先前候选框架
bash
列出所有先前候选框架
ls ~/hoss-evolution/candidates/
读取最佳候选框架
cat ~/hoss-evolution/best/current/eval_scores.json
读取历史日志
tail -20 ~/hoss-evolution/evolution_log.jsonl
步骤2:运行提议智能体
bash
子智能体提议智能体读取~/hoss-evolution/并提出方案
通过加载此技能的openclaw run触发
步骤3:在基准测试前验证
bash
快速语法检查
bash ~/hoss-evolution/scripts/validate.sh
步骤4:运行基准测试
bash
在所有20个场景中评估候选框架
python3 ~/hoss-evolution/scripts/evaluate.py
步骤5:记录结果
bash
得分+轨迹自动写入候选目录
进化日志更新
步骤6:发布到Discord
bash
发布摘要到#research
python3 ~/hoss-evolution/scripts/posttoresearch.py
评分
最终得分=各场景加权平均值:
- - 记忆任务:25%
- 代码任务:25%
- 协调:15%
- 研究:20%
- 通信:10%
- 质量:5%
结果以帕累托前沿形式追踪:对每个候选框架,记录得分和复杂度(变更的大小/差异)。得分相同的情况下,更简单的框架优先。
资源
注意事项
- - 提议智能体子智能体以 runtime=subagent 运行,而非ACP——它需要文件系统访问 ~/hoss-evolution/
- Cron通过 openclaw cron 在此技能外部配置
- 如果提议智能体未能生成有效候选框架,则跳过该次迭代(无惩罚)
- 基准测试场景应足够多样化,确保没有单一策略可以操控所有场景
- 进化工作空间不在 ~/.openclaw/ 内部——它位于 ~/hoss-evolution/,以保持与操作配置分离