LLM-as-Judge

Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.

Activation Criteria

Use this pattern when:

- Architecture or system design decisions
Multi-file changes affecting >5 files or >500 LOC
Security-critical code (auth, payments, crypto/DeFi)
Financial/trading systems (market making, quant strategies)
Planning documents that will drive weeks of work
Stuck after 3+ failed attempts on same problem

Skip when:

- Simple edits, config tweaks, bug fixes with obvious cause
Documentation updates
Single-file changes under 100 LOC
Tasks where self-review is sufficient

The Pattern

CODEBLOCK0

Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)

Model Pairing

Use a different provider than the executor to avoid shared blind spots:

- Executor: Claude → Judge: kimi or grok or INLINECODE2
Executor: Kimi/Gemini → Judge: INLINECODE3
Principle: Different provider, similar capability tier

Judge Prompt Templates

Plan/Architecture Review

See references/judge-prompts.md for full templates covering:

- Plan completeness, feasibility, risk, testing strategy
Architecture review with scoring (0-10 per dimension)
Code review checklist (correctness, design, safety, maintainability)

Integration Points

- With adversarial review: This IS the formalized version of "spawn a separate model to review"
With planning-protocol: Judge reviews the plan before the Execute phase
With coding workflows: Code → cross-model review → fix findings → test → build → push

Quick Decision

CODEBLOCK1

Gotchas

- Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.

LLM作为评审者

核心原则： 同一模型 = 相同的盲区。不同模型 = 全新的视角。跨模型评审能发现约85%的问题，而自我反思仅能发现约60%。

触发条件

在以下情况使用此模式：

- 架构或系统设计决策
影响超过5个文件或超过500行代码的多文件变更
安全关键型代码（认证、支付、加密/DeFi）
金融/交易系统（做市、量化策略）
将驱动数周工作的规划文档
对同一问题尝试3次以上仍无进展时

在以下情况跳过：

- 简单编辑、配置调整、原因明确的Bug修复
文档更新
少于100行代码的单文件变更
自我评审已足够完成的任务

模式流程

执行者（模型A）→ 输出 → 评审者（模型B）→ 裁决 → 行动

裁决结果： 批准 | 修订（附带具体反馈）| 驳回（重新开始）

模型配对

使用与执行者不同的提供商，避免共享盲区：

- 执行者：Claude → 评审者：kimi 或 grok 或 gemini-pro
执行者：Kimi/Gemini → 评审者：opus
原则： 不同提供商，相近能力层级

评审提示模板

计划/架构评审

完整模板请参见 references/judge-prompts.md，涵盖：

- 计划完整性、可行性、风险、测试策略
架构评审及评分（每维度0-10分）
代码评审清单（正确性、设计、安全性、可维护性）

集成点

- 与对抗性评审结合： 这正是启动独立模型进行评审的正式化版本
与规划协议结合： 在执行阶段之前，评审者对计划进行审查
与编码工作流结合： 代码 → 跨模型评审 → 修复问题 → 测试 → 构建 → 推送

快速决策

简单任务？ → 自我评审
复杂/高风险任务？ → LLM作为评审者
多次重试后卡住？ → LLM作为评审者（全新视角）
金融/安全相关？ → LLM作为评审者（强制要求）

注意事项

- 同一提供商违背初衷 — Claude Opus评审Claude Sonnet共享相同的训练分布。请使用不同的提供商（Grok评审Claude、Gemini评审GPT等）。
评审输出模糊毫无用处 — 如果评审者只说看起来不错而没有具体内容，说明提示词太弱。始终要求评审者生成评分维度+具体可操作项，即使批准通过也是如此。
评审者范围蔓延 — 评审者有时会重写整个计划而非进行评审。将裁决限制为批准/修订/驳回并附带具体反馈，而非提供替代方案。
批准率偏移 — 如果评审者批准超过80%的提交内容，说明模型配对过于相似或提示词过于宽松。目标批准率应控制在60-70%。
不要评审琐碎任务 — 一个50行的CSS修复不需要跨模型评审。请严格遵循本技能中的触发条件。

llm-as-judgeLLM交叉验证

llm-as-judge

LLM-as-Judge

Activation Criteria

The Pattern

Model Pairing

Judge Prompt Templates

Plan/Architecture Review

Integration Points

Quick Decision

Gotchas

LLM作为评审者

触发条件

模式流程

模型配对

评审提示模板

计划/架构评审

集成点

快速决策

注意事项

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

llm-as-judgeLLM交叉验证

llm-as-judge

LLM-as-Judge

Activation Criteria

The Pattern

Model Pairing

Judge Prompt Templates

Plan/Architecture Review

Integration Points

Quick Decision

Gotchas

LLM作为评审者

触发条件

模式流程

模型配对

评审提示模板

计划/架构评审

集成点

快速决策

注意事项

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement