Multi-Model Critique

Overview

Use this skill only for complex tasks. Route multiple models through the same 4-step loop (Plan -> Execute -> Review -> Improve), then run cross-critique and synthesis to produce a higher-quality final answer than any single-model draft.

Trigger rule

Enable this skill only when the request explicitly sets complex to true (or equivalent wording such as “this is complex/deep”).

If complex is false, skip this skill and respond with normal single-model behavior.

Inputs

Collect or confirm these inputs before execution:

- complex: boolean flag (must be true)
INLINECODE4: user request
INLINECODE5: list of ACP agentId values (typically 3)
INLINECODE7: output format, language, length, deadlines, forbidden assumptions
INLINECODE8: optional runtime controls (timeoutSec, maxRetries, maxRounds, budgetUsd)

File map (what each file does)

- SKILL.md (this file): orchestration policy, trigger conditions, and execution sequence.
INLINECODE14: reusable prompts for draft, critique, revision, and final synthesis (includes scoring rubric usage).
INLINECODE15: practical OpenClaw orchestration flow using sessions_spawn, sessions_send, and sessions_history.
INLINECODE19: machine-parseable JSON output schema for final result and per-model scoring.
INLINECODE20: utility to generate per-model prompt files for repeated runs.
INLINECODE21: local helper that builds a run plan JSON (model mapping, round prompts, runtime settings).

Workflow

Step 1) Parallel draft round

Spawn one ACP session per model with the same task and constraints.

Per-model requirements:

- Follow the exact internal sequence: INLINECODE22
Print all four sections explicitly
End with INLINECODE23

Use sessions_spawn with runtime:"acp" and explicit agentId.

Step 2) Cross-critique round

Share peer Draft Answer outputs with each model and require structured critique:

- Strengths
Weaknesses
Missing assumptions/data
Hallucination and confidence risks
Concrete fix suggestions

Also require ranking of peer drafts with rationale.

Step 3) Revision round

Send critique feedback back to each original model and request revision:

- Keep INLINECODE28
Include INLINECODE29
End with INLINECODE30

Step 4) Final synthesis round

Integrate revised answers into one user-facing output:

- Best final answer
Why the synthesis is stronger than individual drafts
Remaining uncertainties
Optional next actions

Scoring rubric (required in critique + synthesis)

Score each draft on a 1-5 scale:

- accuracy: factual correctness and internal consistency
INLINECODE32: completeness against user request and constraints
INLINECODE33: quality of assumptions and support
INLINECODE34: usefulness for concrete decision/action

Default weighted score:
INLINECODE35

Use this score to justify rankings and the final selected direction.

Prompting resources

- Use references/prompt-templates.md for canonical prompts.
Use scripts/build_round_prompts.py when you need file-based prompt generation for repeated or batched runs.
Use scripts/run_orchestration.py to generate a deterministic run-plan artifact for reproducible execution.
Use references/orchestration-template.md for concrete OpenClaw tool-call flow.

Required user-facing output shape

1. INLINECODE40
INLINECODE41
INLINECODE42
INLINECODE43 (optional)

When machine consumption is needed, return JSON matching references/output-schema.md.

Do not expose private chain-of-thought. Provide concise reasoning summaries only.

Failure handling

- One model fails: continue with remaining models and note reduced diversity.
Two or more models fail: ask whether to retry or switch to single-model mode.
Strong disagreement remains: present competing hypotheses and state what evidence would resolve them.

Runtime defaults (recommended)

- timeoutSec: 180 per round per model
INLINECODE46: 1 per failed model turn
INLINECODE47: fixed at 4 (draft, critique, revision, synthesis)
INLINECODE48: optional hard stop when cost-sensitive

技能名称：multi-model-critique

详细描述：

多模型评审

概述

仅对复杂任务使用此技能。让多个模型通过相同的四步循环（规划 -> 执行 -> 评审 -> 改进），然后进行交叉评审与综合，生成比任何单模型草稿质量更高的最终答案。

触发规则

仅当请求明确将 complex 设置为 true（或等效措辞，如“此任务复杂/深入”）时，才启用此技能。

如果 complex 为 false，则跳过此技能，以正常的单模型行为进行响应。

输入

执行前收集或确认以下输入：

- complex：布尔标志（必须为 true）
question：用户请求
models：ACP agentId 值列表（通常为 3 个）
constraints：输出格式、语言、长度、截止时间、禁止假设
ops：可选的运行时控制参数（timeoutSec、maxRetries、maxRounds、budgetUsd）

文件映射（各文件功能）

- SKILL.md（本文件）：编排策略、触发条件和执行顺序。
references/prompt-templates.md：用于草稿、评审、修订和最终综合的可复用提示模板（包含评分标准用法）。
references/orchestration-template.md：使用 sessionsspawn、sessionssend 和 sessionshistory 的实用 OpenClaw 编排流程。
references/output-schema.md：用于最终结果和每个模型评分的机器可解析 JSON 输出模式。
scripts/buildroundprompts.py：用于为重复运行生成每个模型提示文件的实用工具。
scripts/runorchestration.py：本地辅助工具，用于构建运行计划 JSON（模型映射、轮次提示、运行时设置）。

工作流程

步骤 1）并行草稿轮次

为每个模型生成一个 ACP 会话，使用相同的任务和约束条件。

每个模型的要求：

- 遵循精确的内部顺序：规划 -> 执行 -> 评审 -> 改进
明确输出所有四个部分
以草稿答案结束

使用 runtime:acp 和明确的 agentId 调用 sessions_spawn。

步骤 2）交叉评审轮次

将同行的草稿答案输出分享给每个模型，并要求进行结构化评审：

- 优势
劣势
缺失的假设/数据
幻觉和置信度风险
具体的改进建议

同时要求对同行草稿进行排序并说明理由。

步骤 3）修订轮次

将评审反馈发送回每个原始模型，并要求进行修订：

- 保留规划 -> 执行 -> 评审 -> 改进
包含根据评审的变更
以修订答案结束

步骤 4）最终综合轮次

将修订后的答案整合为一个面向用户的输出：

- 最佳最终答案
综合结果为何优于单个草稿
剩余的不确定性
可选的后续行动

评分标准（评审和综合中必需）

对每个草稿按 1-5 分制评分：

- accuracy（准确性）：事实正确性和内部一致性
coverage（覆盖度）：对用户请求和约束条件的完整响应
evidence（证据）：假设和支持的质量
actionability（可操作性）：对具体决策/行动的实用性

默认加权分数：
0.40 accuracy + 0.25 coverage + 0.20 evidence + 0.15 actionability

使用此分数来证明排序和最终选定方向的合理性。

提示资源

- 使用 references/prompt-templates.md 获取标准提示模板。
当需要为重复或批量运行生成基于文件的提示时，使用 scripts/buildroundprompts.py。
使用 scripts/run_orchestration.py 生成确定性的运行计划工件，以实现可重复执行。
使用 references/orchestration-template.md 获取具体的 OpenClaw 工具调用流程。

面向用户的输出格式要求

1. 最终答案
评审中的关键改进
不确定性
后续步骤（可选）

当需要机器消费时，返回符合 references/output-schema.md 的 JSON。

不要暴露私有的思维链。仅提供简洁的推理摘要。

故障处理

- 一个模型失败：继续使用其余模型，并注明多样性降低。
两个或更多模型失败：询问是否重试或切换到单模型模式。
存在强烈分歧：提出相互竞争的假设，并说明哪些证据可以解决分歧。

运行时默认值（推荐）

- timeoutSec：每个模型每轮 180 秒
maxRetries：每个失败的模型轮次重试 1 次
maxRounds：固定为 4 轮（草稿、评审、修订、综合）
budgetUsd：对成本敏感时的可选硬性停止

multi-model-critique多模型批判