VibeCoding Pro
The AI coding upgrade that actually ships working software.
VibeCoding is fun. VibeCoding Pro is reliable.
What VibeCoding Gets Wrong
Most AI coding workflows look like this:
CODEBLOCK0
Why it's broken: The same AI that generated the code judges whether it works. It suffers from cognitive commitment bias — it can't objectively evaluate what it just built because it already committed to the approach. Bugs survive. Edge cases break. UX issues ship.
The evidence: Anthropic's 2026 engineering research ran an experiment. Solo Claude agents produced 2D game makers where the core game loop was fundamentally broken — entities rendered but ignored all player input. The agent called its own output "working." Only when a separate Evaluator agent physically clicked through the game did it discover the wiring between entity definitions and game runtime was severed.
What VibeCoding Pro Gets Right
CODEBLOCK1
The structural fix: Evaluator never reads the generator's code, reasoning, or commit messages. It only reads the SPEC and operates the deployed artifact. This eliminates anchoring bias architecturally — not through clever prompting.
When to Use VibeCoding Pro
| Scenario | Apply? | Why |
|---|
| React / H5 / Web UI with real interactions | ✅ Yes | Playwright can actually click through it |
| Multi-step form flows (wizard, checkout, onboarding) |
✅ Yes | Evaluator can exercise each step |
| API + frontend integration | ✅ Yes | Evaluator calls endpoints and checks DB state |
| Single utility function | ⚠️ Optional | Might be overkill |
| Pure backend logic (no UI) | ⚠️ Use API Evaluator template | Evaluator calls endpoints directly |
| Design-sensitive work (brand identity, layout) | ✅ Yes | Human-in-the-loop variant works best |
Quick Start
Step 1: Write a Spec Contract
The SPEC is the most important artifact. It's the Evaluator's only reference.
CODEBLOCK2
Step 2: Run the Loop
- 1. Generator Agent receives: SPEC + iteration history + previous Evaluator feedback
- Generator builds artifact and deploys
- Evaluator Agent receives: SPEC + deployed URL (NOT generator code)
- Evaluator opens browser, clicks through test scenarios, screenshots, scores
- Evaluator returns structured JSON with score breakdown
- If score ≥ threshold → done. If not → loop back to Generator.
Architecture Reference
See references/architecture.md for:
- - Four architecture variants (Sequential / Parallel / Staged / Human-in-loop)
- GAN theory deep-dive and why it works
- Spec Contract template (copy-paste ready)
- History format and loop control logic
- Anti-patterns and how to fix them
Evaluator Templates
See references/evaluator-prompts.md for:
| Template | When to Use | Evaluator Mode |
|---|
| Web/H5 UI | React/Vue/H5/Web components | Playwright browser automation |
| API/Backend |
REST endpoints, microservices | Direct HTTP calls |
|
Content/Docs | Reports, copy, documentation | Structured text scoring |
Each template includes:
- - System prompt (calibrated for evaluator independence)
- User prompt with rubric
- Required JSON output schema
- 4 calibration examples (30/60/85/95 score ranges)
Iteration Loop Scripts
See scripts/iteration_loop.py for a complete Python implementation:
- -
run_generator() — adapt to your agent (Claude API, OpenAI, subagent, etc.) - INLINECODE4 — adapt to your QA stack (Playwright, HTTP client, etc.)
- Full loop control: plateau detection, approach switching, escalation
- CLI: INLINECODE5
See scripts/calibrate_evaluator.py for evaluator calibration utility:
- - Run on 4 known examples before production
- Auto-detects score drift and suggests rubric adjustments
Scoring Rubric
Default rubric (adjust weights by domain):
| Dimension | Weight | Measures |
|---|
| Functional completeness | 30% | Every spec requirement works end-to-end |
| Interaction quality |
25% | Click/form/nav behavior as a real user |
| Edge case handling | 20% | Error states, empty data, boundary inputs |
| Code/design quality | 15% | Consistency, readability, no anti-patterns |
| Originality/craft | 10% | Avoids template defaults and AI slop patterns |
Threshold guidelines:
| Use Case | PASSTHRESHOLD | MAXROUNDS |
|---|
| Internal prototype | 70 | 10 |
| User-facing feature |
85 | 15 |
| Production critical | 95 | 20 + human review |
Why This Works (Research Background)
Source: Anthropic Engineering, "Harness Design for Long-Running Application Development" (March 2026)
Key findings:
- - Solo Claude agents on 16-feature game maker: core game loop broken, entity runtime wiring severed
- Full harness (Generator + Evaluator): fully working, sprite animation, sound, AI-assisted level design
- Opus 4.6 vs 4.5: improved planning reduced harness complexity needed
- Evaluator value is situational: worth the cost when task exceeds what the model reliably does solo
GAN theory parallel: The Generator tries to fool the Evaluator. The Evaluator tries to catch failures the Generator misses. The adversarial tension drives quality upward. Unlike ML GANs, this uses natural language feedback — it's fully inspectable and steerable.
Common Mistakes
| Mistake | Why It Fails | Fix |
|---|
| Same agent generates and evaluates | Cognitive anchoring bias | Separate agents with separate prompts |
| Evaluator reads generator's code |
Judges intent, not reality | Show only deployed URL |
| Skipping calibration | Score inflation/drift | Run 3-5 known examples first |
| Vague scoring ("7/10 looks fine") | Unactionable feedback | Require structured JSON per rubric |
| Too few rounds | Generator never converges | Minimum 10 rounds for complex UI |
| Never switching approach | Gets stuck in local minimum | Switch strategy after 3 plateauing rounds |
| Using for trivial tasks | Overhead > value | Reserve for multi-feature/full-page work |
OpenClaw Integration
In OpenClaw, use the coder + tester subagents:
CODEBLOCK3
The tester subagent should use the Playwright MCP tool:
- -
browser_navigate → open URL - INLINECODE11 → interact
- INLINECODE12 → form input
- INLINECODE13 → capture evidence
Built on Anthropic's 2026 engineering research. Inspired by GAN theory and adversarial validation patterns.
VibeCoding Pro
真正交付可用软件的AI编程升级版。
VibeCoding很有趣。VibeCoding Pro则可靠。
VibeCoding的问题所在
大多数AI编程工作流看起来是这样的:
你 → 构建一个登录表单 → AI生成 → 看起来不错! → 发布上线
↑
这就是问题所在。
为什么有问题: 生成代码的同一个AI来评判代码是否可用。它存在认知承诺偏差——它无法客观评估自己刚刚构建的内容,因为它已经对方案做出了承诺。漏洞得以存活。边界情况被破坏。用户体验问题被发布上线。
证据: Anthropic 2026年的工程研究进行了一项实验。单独的Claude智能体生成了2D游戏制作工具,但核心游戏循环从根本上被破坏了——实体被渲染出来,但完全忽略了所有玩家输入。智能体称自己的输出可用。只有当另一个独立的评估智能体实际点击操作游戏时,才发现实体定义与游戏运行时之间的连接被切断了。
VibeCoding Pro的正确做法
用户目标/规格说明
↓
┌─────────────┐
│ 生成器 │ ← 根据规格说明构建X
│ (vibe) │
└──────┬──────┘
│ 产物
↓
┌────────────────────────────────────┐
│ 评估器 │
│ • 读取规格说明(非生成器输出) │
│ • 在真实浏览器中打开URL │
│ • 点击、填写、导航 │
│ • 按评分标准打分(0-100) │
│ • 返回结构化JSON反馈 │
└────────────────┬───────────────────┘
│ 分数 + 反馈
↓
┌────────────────┐
│ 分数 ≥ 阈值? │
│ 是 → 完成 │
│ 否 → 生成器 │
└────────┬────────┘
└── 循环(5-15轮)
结构性修复: 评估器从不读取生成器的代码、推理过程或提交信息。它只读取规格说明并操作已部署的产物。这从架构层面消除了锚定偏差——而非通过巧妙的提示词。
何时使用VibeCoding Pro
| 场景 | 适用? | 原因 |
|---|
| 具有真实交互的React / H5 / Web UI | ✅ 是 | Playwright可以实际点击操作 |
| 多步骤表单流程(向导、结账、引导流程) |
✅ 是 | 评估器可以执行每一步 |
| API + 前端集成 | ✅ 是 | 评估器调用端点并检查数据库状态 |
| 单一工具函数 | ⚠️ 可选 | 可能过于大材小用 |
| 纯后端逻辑(无UI) | ⚠️ 使用API评估器模板 | 评估器直接调用端点 |
| 设计敏感型工作(品牌标识、布局) | ✅ 是 | 人机协同变体效果最佳 |
快速开始
第一步:编写规格说明契约
规格说明是最重要的产物。它是评估器的唯一参考。
markdown
规格说明:[功能名称] v1.0
目标
[一句话:完成时应该呈现什么?]
功能需求
- - FR-001:[具体、可测试、可观察]
- FR-002:[...]
交互规格说明
- - UI-001:[用户点击X → 发生Y]
- UI-002:[表单接受Y类型,拒绝N类型]
验收标准
- - AC-001:[可衡量的结果]
- AC-002:[...]
不包含范围
测试场景
场景1: 快乐路径——普通用户完成主要操作
场景2: 边界情况——空数据、错误状态
场景3: 极限情况——最大输入长度、并发操作
第二步:运行循环
- 1. 生成器智能体接收:规格说明 + 迭代历史 + 先前评估器反馈
- 生成器构建产物并部署
- 评估器智能体接收:规格说明 + 已部署URL(非生成器代码)
- 评估器打开浏览器,点击操作测试场景,截图,评分
- 评估器返回带有分数细分的结构化JSON
- 如果分数 ≥ 阈值 → 完成。如果否 → 返回生成器。
架构参考
参见 references/architecture.md:
- - 四种架构变体(顺序/并行/分阶段/人机协同)
- GAN理论深度解析及其工作原理
- 规格说明契约模板(可直接复制粘贴)
- 历史格式和循环控制逻辑
- 反模式及其修复方法
评估器模板
参见 references/evaluator-prompts.md:
| 模板 | 何时使用 | 评估器模式 |
|---|
| Web/H5 UI | React/Vue/H5/Web组件 | Playwright浏览器自动化 |
| API/后端 |
REST端点、微服务 | 直接HTTP调用 |
|
内容/文档 | 报告、文案、文档 | 结构化文本评分 |
每个模板包含:
- - 系统提示词(为评估器独立性校准)
- 带评分标准的用户提示词
- 必需的JSON输出模式
- 4个校准示例(30/60/85/95分数范围)
迭代循环脚本
参见 scripts/iteration_loop.py 获取完整的Python实现:
- - rungenerator() — 适配你的智能体(Claude API、OpenAI、子智能体等)
- runevaluator() — 适配你的QA技术栈(Playwright、HTTP客户端等)
- 完整循环控制:平台期检测、方法切换、升级处理
- 命令行:python iteration_loop.py --spec spec.md --url http://localhost:3000 --threshold 85 --rounds 15
参见 scripts/calibrate_evaluator.py 获取评估器校准工具:
- - 在生产环境前对4个已知示例运行
- 自动检测分数漂移并建议评分标准调整
评分标准
默认评分标准(按领域调整权重):
| 维度 | 权重 | 衡量内容 |
|---|
| 功能完整性 | 30% | 每个规格说明需求端到端可用 |
| 交互质量 |
25% | 真实用户视角的点击/表单/导航行为 |
| 边界情况处理 | 20% | 错误状态、空数据、边界输入 |
| 代码/设计质量 | 15% | 一致性、可读性、无反模式 |
| 原创性/工艺 | 10% | 避免模板默认值和AI套路模式 |
阈值指南:
| 使用场景 | 通过阈值 | 最大轮数 |
|---|
| 内部原型 | 70 | 10 |
| 面向用户的功能 |
85 | 15 |
| 生产环境关键功能 | 95 | 20 + 人工审核 |
为什么有效(研究背景)
来源: Anthropic工程团队,面向长期应用开发的控制框架设计(2026年3月)
关键发现:
- - 单独Claude智能体在16功能游戏制作工具上:核心游戏循环被破坏,实体运行时连接被切断
- 完整控制框架(生成器+评估器):完全可用,包含精灵动画、音效、AI辅助关卡设计
- Opus 4.6 vs 4.5:改进的规划能力减少了所需控制框架的复杂度
- 评估器的价值是情境性的:当任务超出模型单独可靠完成的范围时,值得投入成本
GAN理论类比: 生成器试图欺骗评估器。评估器试图发现生成器遗漏的失败。对抗性张力推动质量提升。与机器学习GAN不同,这使用自然语言反馈——完全可检查和可引导。
常见错误
| 错误 | 失败原因 | 修复方法 |
|---|
| 同一智能体生成和评估 | 认知锚定偏差 | 使用独立智能体和独立提示词 |
| 评估器读取生成器的代码 |
评判意图而非现实 | 仅展示已部署的URL |
| 跳过校准 | 分数膨胀/漂移 | 先运行3-5个已知示例 |
| 模糊评分(7/10看起来还行) | 无法操作的反馈 | 要求按评分标准输出结构化JSON |
| 轮数太少 | 生成器从未收敛 | 复杂UI至少10轮 |
| 从不切换方法 | 陷入局部最优 | 连续3轮平台期后切换策略 |
| 用于琐碎任务 | 开销大于价值 | 保留给多功能/整页工作 |
OpenClaw集成
在OpenClaw中,使用coder + tester子智能体:
生成器 → sessions_spawn(agentId=coder, ...)
评估器 → sessions_spawn(agentId=tester, ...