Meta-Harness Evolver

What This Skill Does

Implements the Meta-Harness paper's outer-loop optimization for Hoss — your OpenClaw agent. Each night at 3 AM CDT, this skill:

1. Reads Hoss's current workspace configs + all prior evolution logs
Proposes a targeted harness modification via a coding-agent sub-agent
Evaluates the proposed harness against a benchmark of ~20 diverse task scenarios
Logs the candidate harness + scores + execution traces to the evolution filesystem
Posts a summary report to #research Discord channel

The Meta-Harness Loop

CODEBLOCK0

Quick Start

Cron Schedule

- 3 AM CDT daily — configured via INLINECODE0
Cron command: INLINECODE1

Manual Trigger

CODEBLOCK1

Directory Structure

CODEBLOCK2

What Can Be Evolved

Hoss's "harness" = the configs that wrap the LLM brain:

File	What It Controls
INLINECODE2	Core identity, personality, decision-making style
INLINECODE3

Constraints (do NOT modify):

- Credentials, API keys, or secrets in TOOLS.md
Git safety rules (NEVER mutate git config from ~/flume/)
Security-sensitive groupPolicy settings

The Evolution Algorithm

1. Seed: Start with Hoss's current configs as iteration 0
Propose: Sub-agent reads full history from ~/hoss-evolution/candidates/, identifies failure patterns, proposes 1-2 targeted edits
Validate: Lightweight import/syntax check before running full benchmark
Evaluate: Run proposed harness against all 20 benchmark scenarios, score each
Log: Store candidate harness + scores + proposer reasoning traces
Select: Pareto frontier over (performance, simplicity) — proposer decides which candidates to keep exploring from
Repeat: Next night's proposer can read ALL prior candidates to build on good ideas

Key Insight from the Paper

The skill text is the strongest lever — it steers the proposer. Iterating on the proposer's prompt/role description had more effect than changing iteration count or population size.

The Benchmark

The benchmark lives at ~/hoss-evolution/benchmark/. See references/benchmark-design.md for how to design scenarios and references/harness-spec.md for the full harness spec.

Default benchmark has 20 scenarios across categories:

- Memory: Recall, update, synthesize from memory files
Code: Write, review, debug code tasks
Coordination: Spawn sub-agents, synthesize results
Research: Web search, fetch, summarize, synthesize
Communication: Draft emails, Discord messages, iMessages
Quality: Spot errors, inconsistencies, broken links

Each scenario has:

- A concrete task description
Expected outcome criteria
A scoring rubric (0-3 per scenario: fail / partial / pass / excellent)

The Proposer Agent

The proposer is a coding-agent sub-agent (default: coder) that:

- Reads all prior candidates from ~/hoss-evolution/candidates/ via filesystem ops
Identifies patterns in failed/succeeded candidates
Proposes targeted, specific edits (NOT wholesale rewrites)
Writes proposed configs to the new candidate directory
Logs its reasoning trace so future iterations can build on it

Proposer Skill (passed to sub-agent)

The proposer's role is defined by the task prompt in scripts/propose_harness.py. Key constraints:

- Can only propose edits to files in the harness spec (SOUL.md, IDENTITY.md, AGENTS.md, TOOLS.md, MEMORY.md, HEARTBEAT.md)
Must pass lightweight validation before full evaluation
Should prefer targeted edits over full rewrites
Must log reasoning trace to proposer/logs/

Workflow Steps

Step 1: Read Prior Candidates

CODEBLOCK3

Step 2: Run Proposer

CODEBLOCK4

Step 3: Validate Before Benchmark

CODEBLOCK5

Step 4: Run Benchmark

CODEBLOCK6

Step 5: Log Results

CODEBLOCK7

Step 6: Post to Discord

CODEBLOCK8

Scoring

Final score = weighted average across scenarios:

- Memory tasks: 25%
Code tasks: 25%
Coordination: 15%
Research: 20%
Communication: 10%
Quality: 5%

Results are tracked as a Pareto frontier: for each candidate, log both score and "complexity" (size/diff of changes). Simpler harnesses that score equally get priority.

Resources

- references/harness-spec.md — Full spec of what constitutes Hoss's harness, what can/cannot be modified
references/benchmark-design.md — How to design benchmark scenarios, scoring rubrics, how to add new scenarios
references/evolution-logic.md — Detailed evolution algorithm, parent selection, Pareto frontier logic
scripts/runevolution.py — Main entry point, runs the full loop
scripts/proposeharness.py — The proposer sub-agent task definition
scripts/evaluate.py — Benchmark runner
scripts/postto_research.py — Discord reporter

Notes

- The proposer sub-agent runs with runtime=subagent, not ACP — it needs filesystem access to ~/hoss-evolution/
Cron is configured outside this skill via INLINECODE12
If the proposer fails to produce a valid candidate, the iteration is skipped (no penalty)
Benchmark scenarios should be diverse enough that no single strategy can game all of them
The evolution workspace is NOT inside ~/.openclaw/ — it's at ~/hoss-evolution/ to keep it separate from operational configs

元-框架进化器

该技能的功能

为Hoss（您的OpenClaw智能体）实现元-框架论文中的外循环优化。每天凌晨3点（美国中部时间），该技能将：

1. 读取 Hoss当前的工作空间配置及所有先前的进化日志
通过编码智能体子智能体提出针对性的框架修改方案
评估所提出的框架在约20个多样化任务场景基准测试中的表现
记录候选框架及其得分和执行轨迹到进化文件系统
发布摘要报告到#research Discord频道

元-框架循环

提议智能体 ──(文件系统访问)──► Hoss工作空间
▲ │
│ 提出框架
│ ▼
│ 在基准测试中评估
│ ▼
日志 ───┴── 存储：代码 + 得分 + 轨迹 ──► ~/hoss-evolution/

快速开始

Cron计划

- 每天凌晨3点（美国中部时间） — 通过 openclaw cron 配置
Cron命令：SKILL=meta-harness-evolution TASK=run_evolution openclaw run

手动触发

/openclaw run --skill meta-harness-evolver --task run_evolution

目录结构

~/hoss-evolution/
├── best/ # 迄今为止找到的最佳框架
│ └── current/
├── candidates/ # 所有已评估的框架
│ └── candidate_N/ # 每个候选框架一个目录
│ ├── harness/ # 提议的配置文件（SOUL.md等）
│ ├── eval_scores.json
│ └── traces/ # 执行轨迹
├── benchmark/ # 评估任务+评分器
│ └── scenarios/ # 约20个多样化任务场景
├── proposer/ # 提议智能体的工作空间
│ └── logs/ # 提议智能体自身的推理轨迹
└── evolution_log.jsonl # 完整运行历史

可进化的内容

Hoss的框架=包裹LLM大脑的配置：

文件	控制内容
SOUL.md	核心身份、个性、决策风格
IDENTITY.md

约束条件（请勿修改）：

- TOOLS.md中的凭证、API密钥或密钥
Git安全规则（切勿修改~/flume/中的git配置）
安全敏感的groupPolicy设置

进化算法

1. 种子：以Hoss当前配置作为第0次迭代
提议：子智能体从~/hoss-evolution/candidates/读取完整历史，识别失败模式，提出1-2个针对性编辑
验证：在运行完整基准测试前进行轻量级导入/语法检查
评估：在所有20个基准测试场景中运行提议的框架，对每个场景评分
记录：存储候选框架+得分+提议智能体推理轨迹
选择：帕累托前沿（性能，简洁性）——提议智能体决定从哪些候选框架继续探索
重复：次晚的提议智能体可以读取所有先前候选框架，基于好的想法继续构建

论文关键见解

技能文本是最强的杠杆——它引导提议智能体。迭代提议智能体的提示/角色描述比改变迭代次数或种群规模效果更显著。

基准测试

基准测试位于 ~/hoss-evolution/benchmark/。请参阅 references/benchmark-design.md 了解如何设计场景，以及 references/harness-spec.md 了解完整的框架规范。

默认基准测试包含20个场景，涵盖以下类别：

- 记忆：从记忆文件中回忆、更新、综合
代码：编写、审查、调试代码任务
协调：生成子智能体、综合结果
研究：网络搜索、获取、总结、综合
通信：起草邮件、Discord消息、iMessages
质量：发现错误、不一致、断链

每个场景包含：

- 具体的任务描述
预期结果标准
评分标准（每个场景0-3分：失败/部分通过/通过/优秀）

提议智能体

提议智能体是一个编码智能体子智能体（默认：编码器），它：

- 通过文件系统操作从 ~/hoss-evolution/candidates/ 读取所有先前候选框架
识别失败/成功候选框架中的模式
提出针对性、具体的编辑（而非全面重写）
将提议的配置写入新的候选目录
记录其推理轨迹，以便未来迭代可以在此基础上构建

提议智能体技能（传递给子智能体）

提议智能体的角色由 scripts/propose_harness.py 中的任务提示定义。关键约束：

- 只能对框架规范中的文件（SOUL.md、IDENTITY.md、AGENTS.md、TOOLS.md、MEMORY.md、HEARTBEAT.md）提出编辑
在全面评估前必须通过轻量级验证
应优先进行针对性编辑而非全面重写
必须将推理轨迹记录到 proposer/logs/

工作流程步骤

步骤1：读取先前候选框架

bash

列出所有先前候选框架

ls ~/hoss-evolution/candidates/

读取最佳候选框架

cat ~/hoss-evolution/best/current/eval_scores.json

读取历史日志

tail -20 ~/hoss-evolution/evolution_log.jsonl

步骤2：运行提议智能体

bash

子智能体提议智能体读取~/hoss-evolution/并提出方案

通过加载此技能的openclaw run触发

步骤3：在基准测试前验证

bash

快速语法检查

bash ~/hoss-evolution/scripts/validate.sh

步骤4：运行基准测试

bash

在所有20个场景中评估候选框架

python3 ~/hoss-evolution/scripts/evaluate.py

步骤5：记录结果

bash

得分+轨迹自动写入候选目录

进化日志更新

步骤6：发布到Discord

bash

发布摘要到#research

python3 ~/hoss-evolution/scripts/posttoresearch.py

评分

最终得分=各场景加权平均值：

- 记忆任务：25%
代码任务：25%
协调：15%
研究：20%
通信：10%
质量：5%

结果以帕累托前沿形式追踪：对每个候选框架，记录得分和复杂度（变更的大小/差异）。得分相同的情况下，更简单的框架优先。

资源

- references/harness-spec.md — Hoss框架的完整规范，说明哪些可以/不可以修改
references/benchmark-design.md — 如何设计基准测试场景、评分标准、如何添加新场景
references/evolution-logic.md — 详细的进化算法、父代选择、帕累托前沿逻辑
scripts/runevolution.py — 主入口点，运行完整循环
scripts/proposeharness.py — 提议智能体子智能体任务定义
scripts/evaluate.py — 基准测试运行器
scripts/postto_research.py — Discord报告器

注意事项

- 提议智能体子智能体以 runtime=subagent 运行，而非ACP——它需要文件系统访问 ~/hoss-evolution/
Cron通过 openclaw cron 在此技能外部配置
如果提议智能体未能生成有效候选框架，则跳过该次迭代（无惩罚）
基准测试场景应足够多样化，确保没有单一策略可以操控所有场景
进化工作空间不在 ~/.openclaw/ 内部——它位于 ~/hoss-evolution/，以保持与操作配置分离

meta-harness-evolver元进化器