Iterative Code Evolution

A structured methodology for improving code through disciplined reflect → mutate → verify → score cycles, adapted from the ALMA research framework for meta-learning code designs.

When to Use This Skill

- Iterating on code that isn't working well enough (performance, correctness, design)
Optimizing an implementation across multiple rounds of changes
Debugging persistent or recurring issues where simple fixes keep failing
Evolving a system design through structured experimentation
Any task where you've already tried 2+ approaches and need discipline about what to try next
Building or improving prompts, pipelines, agents, or any "program" that benefits from iterative refinement

When NOT to Use This Skill

- Simple one-shot code generation (just write it)
Mechanical tasks with clear solutions (refactoring, formatting, migrations)
When the user has already specified exactly what to change

Core Concepts

The Evolution Loop

Every improvement cycle follows this sequence:

CODEBLOCK0

The Evolution Log

Track all iterations in .evolution/log.json at the project root. This is the memory that makes each cycle smarter than the last.

CODEBLOCK1

The Process in Detail

Phase 1: ANALYZE — Structured Diagnosis

Before changing anything, perform a structured analysis of the current code and its outputs. This is the most important phase — it prevents wasted mutations.

Step 1 — Learn from past edits (skip on first iteration)

Review the evolution log. For each previous change:

- Did the score improve or degrade?
What pattern made it succeed or fail?
Extract 2-3 principles to adopt and 2-3 pitfalls to avoid

Step 2 — Component-level assessment

For each meaningful component (function, class, module, pipeline stage), label it:

Label	Meaning
Working	Produces correct output, no issues observed
Fragile

For each label, write a one-line explanation of why — linked to specific test outputs or observed behavior.

Step 3 — Quality and coherence check

Look for cross-cutting issues:

- Data flow: Do components pass structured data to each other, or rely on implicit state?
Error handling: Are errors caught and handled, or silently swallowed?
Duplication: Is the same logic repeated in multiple places?
Hardcoding: Are there magic numbers, hardcoded paths, or environment-specific assumptions?
Generalization: Which parts would work on new inputs vs. which are overfitted to test cases?

Step 4 — Produce prioritized suggestions

Based on Steps 1-3, produce concrete changes. Each suggestion must have:

CODEBLOCK2

Rule: Every suggestion must link to an observation. No "this might help" suggestions — only changes grounded in something you actually saw in the code or outputs.

Rule: Limit to 3 suggestions per cycle. More than 3 changes at once makes it impossible to attribute improvement or regression to specific changes.

Phase 2: PLAN — Select What to Change

Pick 1-3 suggestions from the analysis. Selection principles:

- High priority first — fix broken things before optimizing working things
One theme per cycle — don't mix unrelated changes (e.g., don't fix parsing AND refactor error handling in the same mutation)
Prefer targeted over sweeping — a surgical change to one function beats a rewrite of three modules
If stuck, explore — if the last 2+ cycles showed diminishing returns on the same component, pick a different component to modify (this is the ALMA "visit penalty" principle — don't keep grinding on the same thing)

Phase 3: MUTATE — Implement Changes

Write the new code. Key discipline:

- Change only what the plan says. Resist the urge to "fix one more thing" while you're in there.
Preserve interfaces. Don't change function signatures or return types unless the plan explicitly calls for it.
Comment the rationale. Add a brief comment near each change referencing the evolution cycle (e.g., # evo-v003: switched to state machine per edge case failures)

Phase 4: VERIFY — Run and Check

Execute the modified code against the same inputs/tests used for scoring.

If it crashes (up to 3 retries):

Use the reflection-fix protocol:

1. Read the full error traceback
Identify the root cause (not the symptom)
Fix only the root cause — do not make unrelated improvements
Re-run

After 3 failed retries, revert to parent variant and log the failure:
CODEBLOCK3

This failure data is valuable — it prevents re-attempting the same broken approach.

If it runs but produces wrong output:

Don't immediately retry. Go back to Phase 1 (ANALYZE) with the new outputs. The wrong output is diagnostic data.

Phase 5: SCORE — Measure Improvement

Compare the new variant's performance against its parent (not just the baseline). Scoring depends on context:

Context	Score Method
Tests exist	Pass rate: testspassed / totaltests
Performance optimization

Always compute delta vs. parent. This is how you learn which changes help vs. hurt.

Phase 6: ARCHIVE — Log and Learn

Update .evolution/log.json:

1. Record the new variant with parent, description, changes, score, delta
Write a learned field: one sentence about what this cycle taught you
If the score improved, add the underlying principle to INLINECODE4
If the score degraded, add the failure mode to principles_learned as a pitfall

Variant Management

When to Branch vs. Modify

- Modify in place (same file, new version): When the change is clearly incremental (fixing a bug, adding a check, tuning a parameter)
Branch (copy to a new file): When trying a fundamentally different approach (different algorithm, different architecture, different strategy)

Keep branches in .evolution/variants/ with descriptive names. The evolution log tracks which is active.

Selection: Which Variant to Iterate On

If you have multiple variants, pick the next one to improve using:

CODEBLOCK4

Where:

- normalized_reward = variant score relative to baseline (0-1 range)
INLINECODE8 = how many times this variant has been selected for iteration

This balances exploitation (iterating on the best variant) with exploration (trying variants that haven't been touched recently). It prevents getting stuck in local optima.

Quick Reference: Analysis Template

When performing Phase 1, structure your thinking as:

CODEBLOCK5

Example: Full Evolution Cycle

Context: User asks to improve a web scraper that's failing on 40% of target pages.

Cycle 1 — Analysis:

- Component assessment: parse_html() is Broken (crashes on pages with no <article> tag), fetch_page() is Working, extract_links() is Fragile (misses relative URLs)
Cross-cutting: No error handling — one bad page kills the entire batch
Past edits: None (first cycle)
Plan: [High] Add fallback selectors in parse_html() for pages without INLINECODE14

Cycle 1 — Mutate: Add cascading selector logic: try <article>, fall back to <main>, fall back to <body>.

Cycle 1 — Verify: Runs without crashes.

Cycle 1 — Score: Pass rate 40% → 72%. Delta: +32%.

Cycle 1 — Archive: Learned: "Most failures were selector misses, not logic errors. Fallback chains are high-value."

Cycle 2 — Analysis:

- Lessons: Fallback selectors gave +32%. Principle: handle structural variation before fixing logic.
Component assessment: parse_html() now Working. extract_links() still Fragile — relative URLs not resolved.
Plan: [High] Resolve relative URLs using urljoin in INLINECODE21

Cycle 2 — Mutate: Add base URL resolution.

Cycle 2 — Score: 72% → 88%. Delta: +16%.

Cycle 2 — Archive: Learned: "URL resolution was second-biggest failure mode. Always normalize URLs at extraction time."

Key Principles

- Every change must link to an observation — no speculative fixes
Max 3 changes per cycle — attribute improvements accurately
Log everything — failed attempts are as valuable as successes
Score against parent, not just baseline — track marginal improvement
Explore when stuck — if 2+ cycles on the same component show diminishing returns, move to a different component
Revert on 3 failed retries — don't spiral; log the failure and try a different approach
Principles compound — the evolution log's principles_learned list is the most valuable artifact; it encodes what works for this specific codebase

迭代式代码演化

一种通过严格的反思→变异→验证→评分循环来改进代码的结构化方法论，改编自用于元学习代码设计的ALMA研究框架。

何时使用此技能

- 对运行效果不佳的代码进行迭代（性能、正确性、设计）
通过多轮变更优化实现方案
调试简单修复持续失败的持久性或反复性问题
通过结构化实验演进系统设计
任何已尝试2种以上方法且需要纪律性决定下一步尝试方向的任务
构建或改进提示词、流水线、智能体或任何受益于迭代优化的程序

何时不应使用此技能

- 简单的一次性代码生成（直接编写即可）
有明确解决方案的机械性任务（重构、格式化、迁移）
用户已明确指定需要更改的内容时

核心概念

演化循环

每个改进周期遵循以下序列：

┌─────────────────────────────────────────────────────┐
│ 1. 分析 — 对当前代码进行结构化诊断 │
│ 2. 规划 — 确定优先级的具体变更 │
│ 3. 变异 — 实施变更 │
│ 4. 验证 — 运行代码，检查错误 │
│ 5. 评分 — 衡量与基准线的改进程度 │
│ 6. 归档 — 记录尝试内容和结果 │
│ │
│ 带着新知识循环回到第1步 │
└─────────────────────────────────────────────────────┘

演化日志

在项目根目录的.evolution/log.json中跟踪所有迭代。这是使每个周期都比上一个更智能的记忆。

json
{
baseline: {
description: 演化开始前的初始实现,
score: 0.0,
timestamp: 2025-01-15T10:00:00Z
},
variants: {
v001: {
parent: baseline,
description: 添加了输入验证和错误处理,
changes_made: [
{
what: 在所有公共方法上添加了类型检查,
why: 3/10的测试用例因格式错误的输入导致运行时崩溃,
priority: High
}
],
score: 0.6,
delta: +0.6 vs parent,
timestamp: 2025-01-15T10:30:00Z,
learned: 输入验证是主要的失败模式——大多数其他逻辑是健全的
},
v002: {
parent: v001,
description: 重构了解析逻辑以处理边界情况,
changes_made: [
{
what: 将parse_input()重写为使用状态机而非正则表达式,
why: 正则表达式方法在嵌套结构上失败（见于测试用例7、8）,
priority: High
}
],
score: 0.85,
delta: +0.25 vs parent,
timestamp: 2025-01-15T11:00:00Z,
learned: 对于此语法，状态机方法比正则表达式具有更好的泛化能力
}
},
principles_learned: [
输入验证修复能带来最大的早期收益,
基于正则表达式的解析在递归结构上会失效——优先使用状态机,
小范围针对性变更比大规模重写得分更高
]
}

详细流程

阶段1：分析——结构化诊断

在做出任何更改之前，对当前代码及其输出进行结构化分析。这是最重要的阶段——它可以防止无效的变异。

步骤1——从过往编辑中学习（首次迭代跳过）

审查演化日志。对于每次之前的变更：

- 得分是提高了还是降低了？
是什么模式导致了成功或失败？
提取2-3条要采纳的原则和2-3个要避免的陷阱

步骤2——组件级评估

对于每个有意义的组件（函数、类、模块、流水线阶段），进行标记：

标签	含义
正常	产生正确输出，未观察到问题
脆弱

在正常路径上工作，但在边界情况或特定输入上失败 |
| 损坏 | 产生错误输出或报错 |
| 冗余 | 重复其他地方已有的逻辑，增加复杂性而无价值 |
| 缺失 | 尚不存在的必要组件 |

对于每个标签，写一行解释原因——关联到特定的测试输出或观察到的行为。

步骤3——质量和一致性检查

寻找跨领域问题：

- 数据流：组件之间是否传递结构化数据，还是依赖隐式状态？
错误处理：错误是否被捕获并处理，还是被静默忽略？
重复：相同逻辑是否在多个地方重复？
硬编码：是否存在魔法数字、硬编码路径或特定环境的假设？
泛化能力：哪些部分能处理新输入，哪些过度拟合于测试用例？

步骤4——生成优先级建议

基于步骤1-3，生成具体的变更。每个建议必须包含：

- 优先级：高 | 中 | 低
内容：变更的精确描述（代码级别，而非模糊描述）
原因：关联到步骤1-3中的具体观察
风险：如果此变更实施不当可能出现的问题

规则：每个建议必须关联到一个观察。 不允许这可能会有帮助的建议——只有基于你在代码或输出中实际看到的内容的变更。

规则：每个周期最多3个建议。 一次超过3个变更会使得无法将改进或退化归因于特定变更。

阶段2：规划——选择要更改的内容

从分析中选择1-3个建议。选择原则：

- 高优先级优先——先修复损坏的内容，再优化正常的内容
每个周期一个主题——不要混合不相关的变更（例如，不要在同一个变异中既修复解析又重构错误处理）
优先针对性而非全面性——对一个函数进行精确修改胜过重写三个模块
卡住时探索——如果过去2个以上周期在同一个组件上显示出收益递减，选择不同的组件进行修改（这是ALMA访问惩罚原则——不要在同一件事上持续消耗精力）

阶段3：变异——实施变更

编写新代码。关键纪律：

- 只更改计划中指定的内容。 抵制顺便再修一个东西的冲动。
保留接口。 除非计划明确要求，否则不要更改函数签名或返回类型。
注释理由。 在每个变更附近添加简短注释，引用演化周期（例如，# evo-v003: 根据边界情况失败切换到状态机）

阶段4：验证——运行和检查

对用于评分的相同输入/测试执行修改后的代码。

如果崩溃（最多重试3次）：

使用反思-修复协议：

1. 阅读完整的错误回溯
识别根本原因（而非症状）
只修复根本原因——不要做不相关的改进
重新运行

3次重试失败后，回退到父变体并记录失败：
json
{
attempted: 尝试内容的描述,
failure_mode: 无法解决的错误,
learned: 此方法不起作用的原因
}

这些失败数据很有价值——它防止重新尝试相同的错误方法。

如果运行但产生错误输出：

不要立即重试。带着新输出回到阶段1（分析）。错误输出是诊断数据。

阶段5：评分——衡量改进

将新变体的性能与其父变体（而不仅仅是基准线）进行比较。评分取决于上下文：

上下文	评分方法
存在测试	通过率：testspassed / totaltests
性能优化

始终计算与父变体的差异。 这是你了解哪些变更有帮助或有害的方式。

阶段6：归档——记录和学习

更新.evolution/log.json：

1. 记录新变体，包含父变体、描述、变更、得分、差异
编写learned字段：一句话说明此周期教会了你什么
如果得分提高，将基本原则添加到principleslearned
如果得分降低，将失败模式作为陷阱添加到principleslearned

变体管理

何时分支与修改

- 原地修改（同一文件，新版本）：当变更明显是增量式的（修复bug、添加检查、调整参数）
分支（复制到新文件）：当尝试根本不同的方法时（不同算法、不同架构、不同策略）

将分支保存在.evolution/variants/中，使用描述

iterative-code-evolution迭代代码演化