agent-autoresearch
Any agent can run this. The experiment is always: change something → measure it → keep what works.
The Core Idea
Karpathy's insight: give an agent a fixed time budget, let it modify one file, measure if things got better, keep or discard, repeat.
Applied to agents: your workspace is train.py. Your SOUL.md, scripts, and skills are the experiment substrate.
CODEBLOCK0
You are not just optimizing content. You are optimizing the agent itself.
What Can Be Mutated
The agent can propose changes to any file it owns:
| Category | Examples |
|---|
| Behavior | New response patterns, different tone, new check routines |
| Workflow |
New scripts, automations, cron jobs, notification flows |
|
Memory | Updated MEMORY.md entries, new daily conventions |
|
Identity | Revised SOUL.md directives, new operational rules |
|
Skills | New skill installations, skill configurations |
|
Quality | New validation logic, error handling patterns |
The agent cannot mutate: safety rules, constitution, security boundaries, or files it doesn't own.
Project Structure
CODEBLOCK1
🚀 Quick Start
CODEBLOCK2
CODEBLOCK3
Baseline Metrics
Track what matters for the agent's mission. Examples:
| Mission | Metric | How to Measure |
|---|
| Task completion | INLINECODE1 | % tasks completed vs assigned |
| Response quality |
output_quality_score | Human rating 1-10 or diff-based |
| Speed |
avg_response_time_s | Seconds per response |
| Self-improvement |
learnings_logged | Entries added to MEMORY.md per week |
| Autonomy |
escalations_to_human | Times human was unnecessarily interrupted |
Establish baseline with ≥ 10 measurements before running experiments.
Verdict Logic
CODEBLOCK4
For quality/rating metrics (higher is better): above thresholds apply.
For cost/latency metrics (lower is better): flip the sign in calculation.
Key Rules
- - ❌ One mutation at a time — test one change per experiment
- ❌ No baseline — need ≥10 measurements before experimenting
- ❌ Vibes verdicts — use actual measurements
- ❌ Mutate safety/constitution files — never
- ❌ Kill streak ≥ 3 → pause and wait for human review
- ❌ Infinite MODIFY — max one extension
- ❌ Revert a KEEP — only a newer KEEP overrides
Commands
| Command | What |
|---|
| INLINECODE6 | Check current state |
| INLINECODE7 |
Establish baseline |
|
python3 analyze.py experiments/active.md --auto | Compute verdict |
|
python3 evolve.py experiments/active.md | Execute KEEP verdict |
|
python3 evolve.py experiments/active.md --kill | Execute KILL verdict |
Security
- - Agents can only mutate files within their own workspace
- Safety rules and constitution are always excluded from mutation
- External API calls require human approval
- Destructive operations (rm, git reset --hard) require explicit confirmation
agent-autoresearch
任何智能体均可运行此实验。实验流程始终是:修改某处 → 测量效果 → 保留有效方案。
核心理念
Karpathy的洞见:给智能体一个固定的时间预算,让它修改一个文件,测量效果是否改善,保留或丢弃,重复循环。
应用于智能体:你的工作空间是train.py。你的SOUL.md、脚本和技能就是实验基质。
提出方案 → 实施修改 → 测量效果 → 保留/淘汰 → 整合集成 → 重复循环
你不仅是在优化内容。你是在优化智能体本身。
可变异范围
智能体可以对其拥有的任何文件提出修改建议:
| 类别 | 示例 |
|---|
| 行为 | 新的响应模式、不同语气、新的检查流程 |
| 工作流 |
新脚本、自动化任务、定时任务、通知流程 |
|
记忆 | 更新的MEMORY.md条目、新的日常惯例 |
|
身份 | 修订的SOUL.md指令、新的操作规则 |
|
技能 | 新技能安装、技能配置 |
|
质量 | 新的验证逻辑、错误处理模式 |
智能体不可变异:安全规则、基本章程、安全边界,或它不拥有的文件。
项目结构
agent-autoresearch/
├── SKILL.md ← 你在此处
├── program.md ← 🧠 实验智能体的指令
├── prepare.py ← 建立基线指标
├── evolve.py ← 将保留判定集成到智能体文件
├── analyze.py ← 根据测量结果计算判定
├── baseline.json ← 当前智能体基线(性能+策略)
├── results.tsv ← 所有实验结果(仅追加日志)
└── experiments/
├── meta.json ← 实验状态(下一个实验ID、连续淘汰次数)
├── active.md ← 一次只进行一个活跃实验
└── archive/ ← 已完成实验
🚀 快速开始
bash
1. 建立基线(测量当前智能体性能)
python3 prepare.py --metric task
completionrate --baseline 0.75
2. 阅读实验简报
cat program.md
3. 开始实验循环
智能体读取program.md,提出自我改进方案,实施修改,
测量结果,并执行保留/淘汰判定。
bash
检查当前状态
python3 prepare.py --status
基线指标
追踪对智能体任务重要的指标。示例:
| 任务 | 指标 | 测量方法 |
|---|
| 任务完成 | taskcompletionrate | 已完成任务与分配任务的百分比 |
| 响应质量 |
output
qualityscore | 人工评分1-10或基于差异的评分 |
| 速度 | avg
responsetime_s | 每次响应所需秒数 |
| 自我改进 | learnings_logged | 每周添加到MEMORY.md的条目数 |
| 自主性 | escalations
tohuman | 不必要中断人类的次数 |
运行实验前,需建立≥10次测量的基线。
判定逻辑
改进率 = (实验得分 - 基线得分) / 基线得分
≥ +10% → 保留(将变更集成到智能体中)
≤ -10% → 淘汰(丢弃,恢复到之前状态)
-10% 至 +10% → 修改(延长评估或视为淘汰)
对于质量/评分指标(越高越好):适用上述阈值。
对于成本/延迟指标(越低越好):计算时翻转符号。
关键规则
- - ❌ 一次只做一个变异——每个实验只测试一个变更
- ❌ 无基线——实验前需≥10次测量
- ❌ 凭感觉判定——使用实际测量数据
- ❌ 变异安全/章程文件——绝对禁止
- ❌ 连续淘汰≥3次→暂停并等待人工审核
- ❌ 无限修改——最多延长一次
- ❌ 撤销保留——只有更新的保留判定才能覆盖
命令
| 命令 | 功能 |
|---|
| python3 prepare.py --status | 检查当前状态 |
| python3 prepare.py --metric X --baseline Y |
建立基线 |
| python3 analyze.py experiments/active.md --auto | 计算判定 |
| python3 evolve.py experiments/active.md | 执行保留判定 |
| python3 evolve.py experiments/active.md --kill | 执行淘汰判定 |
安全
- - 智能体只能在其自己的工作空间内变异文件
- 安全规则和基本章程始终排除在变异范围之外
- 外部API调用需要人工批准
- 破坏性操作(rm、git reset --hard)需要明确确认