Sprint Contract — Multi-Agent Quality System

Based on Anthropic's harness design for long-running apps: separate the agent doing the work from the agent judging it.

Core Principle

Never let the builder evaluate their own work on complex tasks. LLMs reliably praise their own output — even when it's mediocre. An independent evaluator, tuned to be skeptical, catches what self-evaluation misses.

Architecture

CODEBLOCK0

Workflow

1. Write a BRIEF.md with Sprint Contract

Every task gets a BRIEF.md. The Sprint Contract section is mandatory — it lists specific, testable completion criteria.

CODEBLOCK1

2. Spawn Generator (Builder)

The generator receives the BRIEF.md and builds against the Sprint Contract. Key rules for the generator prompt:

- Work against the Sprint Contract criteria
Self-check each criterion before handing off
Write HANDOFF.md when done
Write files first, read references second (output > research)

3. Spawn Evaluator (Independent QA)

After the generator finishes, spawn a separate agent as evaluator. The evaluator prompt must include:

The Sprint Contract — copied from BRIEF.md, to verify each criterion.

4 Evaluation Dimensions (select what's relevant):

Dimension	What to check
Functional completeness	Every Sprint Contract criterion passes
User experience

The critical prompt line:

"Your job is to find problems, not to praise. If everything looks fine, you probably didn't test carefully enough. Report issues honestly — better a false alarm than a missed bug."

4. Decision Gate

Based on evaluator feedback:

- All criteria pass → Ship it
Criteria fail → Feed evaluator report back to generator for fixes
Architecture issues → Escalate to human

When to Use Each Mode

Task complexity	Generator	Evaluator	Example
Simple (< 30 min)	Sub-agent	Self-evaluate, mark "⚠️ untested"	Fix a typo, update config
Medium (30 min - 2 hr)

Sprint Contract Examples

See references/contract-examples.md for project-specific contract templates.

Key Insights from Anthropic's Research

1. File-based communication — Agents talk through files (BRIEF.md, HANDOFF.md), not conversation
Evaluator calibration — Default LLMs are too lenient; explicitly prompt for skepticism
Sprint scoping — One feature at a time; don't bundle unrelated work
Opus 4.6 + 1M context — Context anxiety is gone; sprint decomposition is less critical, but evaluator still adds value at task boundaries
Evaluation criteria shape output — The wording of your criteria directly steers what the generator produces

Sprint Contract — 多智能体质量系统

基于Anthropic的长期运行应用框架设计：将执行工作的智能体与评判工作的智能体分离。

核心原则

绝不让构建者在复杂任务上评估自己的工作。 大语言模型会可靠地赞美自己的输出——即使输出平庸。一个独立的、被调校为持怀疑态度的评估者，能够捕捉到自我评估遗漏的问题。

架构

规划者（你/人类） → 生成器（子智能体） → 评估者（独立子智能体）
↑ |
└──────────── 反馈循环 ←────────────────────────┘

工作流程

1. 使用Sprint Contract编写BRIEF.md

每个任务都有一个BRIEF.md。Sprint Contract部分是强制性的——它列出了具体的、可测试的完成标准。

markdown

任务简报

背景

[此任务存在的原因]

目标

[要构建/修复的内容]

Sprint Contract（完成标准）

- [ ] 标准1（具体、可验证）
[ ] 标准2
[ ] ...

⚠️ 编写针对此任务的具体标准。不要使用通用检查清单。

约束条件

[技术栈、先前决策、已知陷阱]

交接要求

完成后编写HANDOFF.md，包含：

- 已完成的工作（文件变更列表）
做出的设计决策（及原因）
遗留问题/已知问题
向人类报告所需的一切信息

2. 启动生成器（构建者）

生成器接收BRIEF.md并根据Sprint Contract进行构建。生成器提示的关键规则：

- 按照Sprint Contract标准工作
在交接前自我检查每个标准
完成后编写HANDOFF.md
先写文件，后读参考资料（输出 > 研究）

3. 启动评估者（独立质量保证）

生成器完成后，启动一个独立的智能体作为评估者。评估者提示必须包含：

Sprint Contract — 从BRIEF.md复制，用于验证每个标准。

4个评估维度（选择相关维度）：

维度	检查内容
功能完整性	每个Sprint Contract标准均通过
用户体验

关键的提示行：

你的工作是发现问题，而不是赞美。如果一切看起来都很好，你可能测试得不够仔细。诚实地报告问题——误报总比漏掉bug好。

4. 决策关口

基于评估者反馈：

- 所有标准通过 → 发布
标准未通过 → 将评估者报告反馈给生成器进行修复
架构问题 → 升级给人类

何时使用每种模式

任务复杂度	生成器	评估者	示例
简单（< 30分钟）	子智能体	自我评估，标记⚠️ 未测试	修复拼写错误，更新配置
中等（30分钟 - 2小时）

Sprint Contract示例

参见references/contract-examples.md获取项目特定的合同模板。

Anthropic研究的关键见解

1. 基于文件的通信 — 智能体通过文件（BRIEF.md、HANDOFF.md）交流，而非对话
评估者校准 — 默认的大语言模型过于宽容；明确提示要求持怀疑态度
Sprint范围界定 — 一次一个功能；不要捆绑不相关的工作
Opus 4.6 + 1M上下文 — 上下文焦虑已消失；Sprint分解不再那么关键，但评估者在任务边界处仍有价值
评估标准塑造输出 — 标准的措辞直接引导生成器的产出

sprint-contract冲刺合约

sprint-contract

Sprint Contract — Multi-Agent Quality System

Core Principle

Architecture