Sprint Contract — Multi-Agent Quality System
Based on Anthropic's harness design for long-running apps: separate the agent doing the work from the agent judging it.
Core Principle
Never let the builder evaluate their own work on complex tasks. LLMs reliably praise their own output — even when it's mediocre. An independent evaluator, tuned to be skeptical, catches what self-evaluation misses.
Architecture
CODEBLOCK0
Workflow
1. Write a BRIEF.md with Sprint Contract
Every task gets a BRIEF.md. The Sprint Contract section is mandatory — it lists specific, testable completion criteria.
CODEBLOCK1
2. Spawn Generator (Builder)
The generator receives the BRIEF.md and builds against the Sprint Contract. Key rules for the generator prompt:
- - Work against the Sprint Contract criteria
- Self-check each criterion before handing off
- Write HANDOFF.md when done
- Write files first, read references second (output > research)
3. Spawn Evaluator (Independent QA)
After the generator finishes, spawn a separate agent as evaluator. The evaluator prompt must include:
The Sprint Contract — copied from BRIEF.md, to verify each criterion.
4 Evaluation Dimensions (select what's relevant):
| Dimension | What to check |
|---|
| Functional completeness | Every Sprint Contract criterion passes |
| User experience |
Flow is intuitive, no dead ends |
|
Visual quality | Layout, spacing, colors are professional |
|
Code/content quality | No errors, clean logic, no regressions |
The critical prompt line:
"Your job is to find problems, not to praise. If everything looks fine, you probably didn't test carefully enough. Report issues honestly — better a false alarm than a missed bug."
4. Decision Gate
Based on evaluator feedback:
- - All criteria pass → Ship it
- Criteria fail → Feed evaluator report back to generator for fixes
- Architecture issues → Escalate to human
When to Use Each Mode
| Task complexity | Generator | Evaluator | Example |
|---|
| Simple (< 30 min) | Sub-agent | Self-evaluate, mark "⚠️ untested" | Fix a typo, update config |
| Medium (30 min - 2 hr) |
Sub-agent | Independent sub-agent | New feature, bug fix |
| Complex (2+ hr) | Claude Code / ACP | Independent sub-agent + human review | Architecture change, new project |
Sprint Contract Examples
See references/contract-examples.md for project-specific contract templates.
Key Insights from Anthropic's Research
- 1. File-based communication — Agents talk through files (BRIEF.md, HANDOFF.md), not conversation
- Evaluator calibration — Default LLMs are too lenient; explicitly prompt for skepticism
- Sprint scoping — One feature at a time; don't bundle unrelated work
- Opus 4.6 + 1M context — Context anxiety is gone; sprint decomposition is less critical, but evaluator still adds value at task boundaries
- Evaluation criteria shape output — The wording of your criteria directly steers what the generator produces
Sprint Contract — 多智能体质量系统
基于Anthropic的长期运行应用框架设计:将执行工作的智能体与评判工作的智能体分离。
核心原则
绝不让构建者在复杂任务上评估自己的工作。 大语言模型会可靠地赞美自己的输出——即使输出平庸。一个独立的、被调校为持怀疑态度的评估者,能够捕捉到自我评估遗漏的问题。
架构
规划者(你/人类) → 生成器(子智能体) → 评估者(独立子智能体)
↑ |
└──────────── 反馈循环 ←────────────────────────┘
工作流程
1. 使用Sprint Contract编写BRIEF.md
每个任务都有一个BRIEF.md。Sprint Contract部分是强制性的——它列出了具体的、可测试的完成标准。
markdown
任务简报
背景
[此任务存在的原因]
目标
[要构建/修复的内容]
Sprint Contract(完成标准)
- - [ ] 标准1(具体、可验证)
- [ ] 标准2
- [ ] ...
⚠️ 编写针对此任务的具体标准。不要使用通用检查清单。
相关文件
[与任务相关的文件路径]
约束条件
[技术栈、先前决策、已知陷阱]
交接要求
完成后编写HANDOFF.md,包含:
- - 已完成的工作(文件变更列表)
- 做出的设计决策(及原因)
- 遗留问题/已知问题
- 向人类报告所需的一切信息
2. 启动生成器(构建者)
生成器接收BRIEF.md并根据Sprint Contract进行构建。生成器提示的关键规则:
- - 按照Sprint Contract标准工作
- 在交接前自我检查每个标准
- 完成后编写HANDOFF.md
- 先写文件,后读参考资料(输出 > 研究)
3. 启动评估者(独立质量保证)
生成器完成后,启动一个独立的智能体作为评估者。评估者提示必须包含:
Sprint Contract — 从BRIEF.md复制,用于验证每个标准。
4个评估维度(选择相关维度):
| 维度 | 检查内容 |
|---|
| 功能完整性 | 每个Sprint Contract标准均通过 |
| 用户体验 |
流程直观,无死胡同 |
|
视觉质量 | 布局、间距、颜色专业 |
|
代码/内容质量 | 无错误,逻辑清晰,无回归问题 |
关键的提示行:
你的工作是发现问题,而不是赞美。如果一切看起来都很好,你可能测试得不够仔细。诚实地报告问题——误报总比漏掉bug好。
4. 决策关口
基于评估者反馈:
- - 所有标准通过 → 发布
- 标准未通过 → 将评估者报告反馈给生成器进行修复
- 架构问题 → 升级给人类
何时使用每种模式
| 任务复杂度 | 生成器 | 评估者 | 示例 |
|---|
| 简单(< 30分钟) | 子智能体 | 自我评估,标记⚠️ 未测试 | 修复拼写错误,更新配置 |
| 中等(30分钟 - 2小时) |
子智能体 | 独立子智能体 | 新功能,bug修复 |
| 复杂(2小时以上) | Claude Code / ACP | 独立子智能体 + 人工审查 | 架构变更,新项目 |
Sprint Contract示例
参见references/contract-examples.md获取项目特定的合同模板。
Anthropic研究的关键见解
- 1. 基于文件的通信 — 智能体通过文件(BRIEF.md、HANDOFF.md)交流,而非对话
- 评估者校准 — 默认的大语言模型过于宽容;明确提示要求持怀疑态度
- Sprint范围界定 — 一次一个功能;不要捆绑不相关的工作
- Opus 4.6 + 1M上下文 — 上下文焦虑已消失;Sprint分解不再那么关键,但评估者在任务边界处仍有价值
- 评估标准塑造输出 — 标准的措辞直接引导生成器的产出