LLM-as-Judge
Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.
Activation Criteria
Use this pattern when:
- - Architecture or system design decisions
- Multi-file changes affecting >5 files or >500 LOC
- Security-critical code (auth, payments, crypto/DeFi)
- Financial/trading systems (market making, quant strategies)
- Planning documents that will drive weeks of work
- Stuck after 3+ failed attempts on same problem
Skip when:
- - Simple edits, config tweaks, bug fixes with obvious cause
- Documentation updates
- Single-file changes under 100 LOC
- Tasks where self-review is sufficient
The Pattern
CODEBLOCK0
Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)
Model Pairing
Use a different provider than the executor to avoid shared blind spots:
- - Executor: Claude → Judge:
kimi or grok or INLINECODE2 - Executor: Kimi/Gemini → Judge: INLINECODE3
- Principle: Different provider, similar capability tier
Judge Prompt Templates
Plan/Architecture Review
See
references/judge-prompts.md for full templates covering:
- - Plan completeness, feasibility, risk, testing strategy
- Architecture review with scoring (0-10 per dimension)
- Code review checklist (correctness, design, safety, maintainability)
Integration Points
- - With adversarial review: This IS the formalized version of "spawn a separate model to review"
- With planning-protocol: Judge reviews the plan before the Execute phase
- With coding workflows: Code → cross-model review → fix findings → test → build → push
Quick Decision
CODEBLOCK1
Gotchas
- - Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
- Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
- Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
- Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
- Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.
LLM作为评审者
核心原则: 同一模型 = 相同的盲区。不同模型 = 全新的视角。跨模型评审能发现约85%的问题,而自我反思仅能发现约60%。
触发条件
在以下情况使用此模式:
- - 架构或系统设计决策
- 影响超过5个文件或超过500行代码的多文件变更
- 安全关键型代码(认证、支付、加密/DeFi)
- 金融/交易系统(做市、量化策略)
- 将驱动数周工作的规划文档
- 对同一问题尝试3次以上仍无进展时
在以下情况跳过:
- - 简单编辑、配置调整、原因明确的Bug修复
- 文档更新
- 少于100行代码的单文件变更
- 自我评审已足够完成的任务
模式流程
执行者(模型A)→ 输出 → 评审者(模型B)→ 裁决 → 行动
裁决结果: 批准 | 修订(附带具体反馈)| 驳回(重新开始)
模型配对
使用与执行者不同的提供商,避免共享盲区:
- - 执行者:Claude → 评审者:kimi 或 grok 或 gemini-pro
- 执行者:Kimi/Gemini → 评审者:opus
- 原则: 不同提供商,相近能力层级
评审提示模板
计划/架构评审
完整模板请参见 references/judge-prompts.md,涵盖:
- - 计划完整性、可行性、风险、测试策略
- 架构评审及评分(每维度0-10分)
- 代码评审清单(正确性、设计、安全性、可维护性)
集成点
- - 与对抗性评审结合: 这正是启动独立模型进行评审的正式化版本
- 与规划协议结合: 在执行阶段之前,评审者对计划进行审查
- 与编码工作流结合: 代码 → 跨模型评审 → 修复问题 → 测试 → 构建 → 推送
快速决策
简单任务? → 自我评审
复杂/高风险任务? → LLM作为评审者
多次重试后卡住? → LLM作为评审者(全新视角)
金融/安全相关? → LLM作为评审者(强制要求)
注意事项
- - 同一提供商违背初衷 — Claude Opus评审Claude Sonnet共享相同的训练分布。请使用不同的提供商(Grok评审Claude、Gemini评审GPT等)。
- 评审输出模糊毫无用处 — 如果评审者只说看起来不错而没有具体内容,说明提示词太弱。始终要求评审者生成评分维度+具体可操作项,即使批准通过也是如此。
- 评审者范围蔓延 — 评审者有时会重写整个计划而非进行评审。将裁决限制为批准/修订/驳回并附带具体反馈,而非提供替代方案。
- 批准率偏移 — 如果评审者批准超过80%的提交内容,说明模型配对过于相似或提示词过于宽松。目标批准率应控制在60-70%。
- 不要评审琐碎任务 — 一个50行的CSS修复不需要跨模型评审。请严格遵循本技能中的触发条件。