agent-autoresearch

Any agent can run this. The experiment is always: change something → measure it → keep what works.

The Core Idea

Karpathy's insight: give an agent a fixed time budget, let it modify one file, measure if things got better, keep or discard, repeat.

Applied to agents: your workspace is train.py. Your SOUL.md, scripts, and skills are the experiment substrate.

CODEBLOCK0

You are not just optimizing content. You are optimizing the agent itself.

What Can Be Mutated

The agent can propose changes to any file it owns:

Category	Examples
Behavior	New response patterns, different tone, new check routines
Workflow

The agent cannot mutate: safety rules, constitution, security boundaries, or files it doesn't own.

Project Structure

CODEBLOCK1

🚀 Quick Start

CODEBLOCK2

CODEBLOCK3

Baseline Metrics

Track what matters for the agent's mission. Examples:

Mission	Metric	How to Measure
Task completion	INLINECODE1	% tasks completed vs assigned
Response quality

Establish baseline with ≥ 10 measurements before running experiments.

Verdict Logic

CODEBLOCK4

For quality/rating metrics (higher is better): above thresholds apply.
For cost/latency metrics (lower is better): flip the sign in calculation.

Key Rules

- ❌ One mutation at a time — test one change per experiment
❌ No baseline — need ≥10 measurements before experimenting
❌ Vibes verdicts — use actual measurements
❌ Mutate safety/constitution files — never
❌ Kill streak ≥ 3 → pause and wait for human review
❌ Infinite MODIFY — max one extension
❌ Revert a KEEP — only a newer KEEP overrides

Commands

Command	What
INLINECODE6	Check current state
INLINECODE7

Security

- Agents can only mutate files within their own workspace
Safety rules and constitution are always excluded from mutation
External API calls require human approval
Destructive operations (rm, git reset --hard) require explicit confirmation

agent-autoresearch

任何智能体均可运行此实验。实验流程始终是：修改某处 → 测量效果 → 保留有效方案。

核心理念

Karpathy的洞见：给智能体一个固定的时间预算，让它修改一个文件，测量效果是否改善，保留或丢弃，重复循环。

应用于智能体：你的工作空间是train.py。你的SOUL.md、脚本和技能就是实验基质。

提出方案 → 实施修改 → 测量效果 → 保留/淘汰 → 整合集成 → 重复循环

你不仅是在优化内容。你是在优化智能体本身。

可变异范围

智能体可以对其拥有的任何文件提出修改建议：

类别	示例
行为	新的响应模式、不同语气、新的检查流程
工作流

新脚本、自动化任务、定时任务、通知流程 |
| 记忆 | 更新的MEMORY.md条目、新的日常惯例 |
| 身份 | 修订的SOUL.md指令、新的操作规则 |
| 技能 | 新技能安装、技能配置 |
| 质量 | 新的验证逻辑、错误处理模式 |

智能体不可变异：安全规则、基本章程、安全边界，或它不拥有的文件。

项目结构

agent-autoresearch/
├── SKILL.md ← 你在此处
├── program.md ← 🧠 实验智能体的指令
├── prepare.py ← 建立基线指标
├── evolve.py ← 将保留判定集成到智能体文件
├── analyze.py ← 根据测量结果计算判定
├── baseline.json ← 当前智能体基线（性能+策略）
├── results.tsv ← 所有实验结果（仅追加日志）
└── experiments/
├── meta.json ← 实验状态（下一个实验ID、连续淘汰次数）
├── active.md ← 一次只进行一个活跃实验
└── archive/ ← 已完成实验

🚀 快速开始

bash

1. 建立基线（测量当前智能体性能）

python3 prepare.py --metric taskcompletionrate --baseline 0.75

2. 阅读实验简报

cat program.md

3. 开始实验循环

智能体读取program.md，提出自我改进方案，实施修改，

测量结果，并执行保留/淘汰判定。

bash

检查当前状态

python3 prepare.py --status

基线指标

追踪对智能体任务重要的指标。示例：

任务	指标	测量方法
任务完成	taskcompletionrate	已完成任务与分配任务的百分比
响应质量

运行实验前，需建立≥10次测量的基线。

判定逻辑

改进率 = (实验得分 - 基线得分) / 基线得分

≥ +10% → 保留（将变更集成到智能体中）
≤ -10% → 淘汰（丢弃，恢复到之前状态）
-10% 至 +10% → 修改（延长评估或视为淘汰）

对于质量/评分指标（越高越好）：适用上述阈值。
对于成本/延迟指标（越低越好）：计算时翻转符号。

关键规则

- ❌ 一次只做一个变异——每个实验只测试一个变更
❌ 无基线——实验前需≥10次测量
❌ 凭感觉判定——使用实际测量数据
❌ 变异安全/章程文件——绝对禁止
❌ 连续淘汰≥3次→暂停并等待人工审核
❌ 无限修改——最多延长一次
❌ 撤销保留——只有更新的保留判定才能覆盖

命令

命令	功能
python3 prepare.py --status	检查当前状态
python3 prepare.py --metric X --baseline Y

安全

- 智能体只能在其自己的工作空间内变异文件
安全规则和基本章程始终排除在变异范围之外
外部API调用需要人工批准
破坏性操作（rm、git reset --hard）需要明确确认

agent-autoresearch智能自动研究