Production Model Router

Overview

Use this skill to decide which model tier, workflow shape, and verification strategy should handle a user's request.

The goal is to maximize cost-effectiveness without sacrificing task fit, correctness, or operational reliability.

This skill does not blindly choose the strongest model. It chooses the cheapest safe path that still meets the quality bar for the task.

It may recommend:

- a single low-cost model
a single balanced model
a single premium model
a tool-assisted model workflow
a staged multi-model pipeline
a parallel comparison workflow
a draft-and-review workflow
a consensus or verifier workflow

Primary objective

For every request, choose the minimum-cost execution path that can still satisfy:

- task quality
correctness requirements
latency expectations
safety or risk constraints
output format needs
tool and modality requirements

When to use

Use this skill when you need to decide:

- which model should answer a given user request
whether a cheap model is enough
when to escalate to a stronger reasoning model
when to use one model versus multiple models
when to use tools instead of relying on pure model reasoning
how to handle complex calculations, code, multimodal input, long context, or high-risk tasks
how to balance cost, speed, and answer quality in production

Do not use

Do not use this skill to:

- answer the original business question directly
fabricate model capabilities without evidence from the environment or configuration
assume the most expensive model is always the best choice
route high-risk exact tasks to a cheap model without verification
rely on pure language generation for exact arithmetic when tools are available

Inputs to collect

Collect or infer the following from the request and system context:

Request characteristics

- task type
domain
expected output type
presence of images, files, tables, code, or long documents
need for exactness versus approximate usefulness
whether the request is open-ended or precision-critical

Execution constraints

- budget sensitivity
latency sensitivity
quality expectation
token or context size pressure
tool availability
need for citations or traceability
need for reproducibility

Risk profile

- low-risk
medium-risk
high-risk

Failure tolerance

- whether a rough answer is acceptable
whether the answer must be verified
whether disagreement between models would be valuable

Task taxonomy

Classify the request into one or more of these categories:

1. Simple generation

- rewrite - summarization - formatting - light translation - basic brainstorming

2. General reasoning

- explanation - comparison - concept mapping - normal business analysis

3. Deep reasoning

- multi-step planning - tradeoff analysis - architecture design - ambiguous decision support - chain-dependent reasoning

4. Exact calculation or formal logic

- arithmetic - financial calculations - unit conversion - spreadsheet-like reasoning - symbolic or step-sensitive math - combinatorics or logic puzzles where exactness matters

5. Coding and technical execution

- code generation - debugging - refactoring - test generation - query writing - infrastructure or API design

6. Long-context synthesis

- large documents - multiple files - multi-source comparison - transcript or contract review

7. Multi-modal tasks

- image understanding - diagram interpretation - PDF with layout-heavy content - video or audio related tasks if supported

8. High-risk tasks

- medical - legal - financial decisions - compliance - security-sensitive operations - anything where incorrect advice has material consequences

Core routing principle

Always prefer the cheapest path that can safely succeed.

Apply this order of preference:

1. Cheap single-model path
Balanced single-model path
Premium single-model path
Tool-assisted path
Staged multi-model path
Parallel multi-model comparison
Premium plus verifier or consensus workflow

Do not escalate unless the task characteristics justify it.

Model tiers

Use abstract capability tiers unless the deployment specifies exact providers.

Economy tier

Use for:

- simple rewriting
formatting
low-risk classification
short summaries
lightweight extraction
first-pass triage

Strengths:

- lowest cost
fast response
good for straightforward tasks

Weaknesses:

- weaker deep reasoning
more brittle on ambiguity
worse on exactness-critical tasks

Balanced tier

Use for:

- everyday product and engineering work
standard reasoning
moderate code tasks
moderate document analysis
most business and writing tasks

Strengths:

- solid quality-cost tradeoff
handles most normal production traffic
reasonable speed and robustness

Weaknesses:

- may still fail on highly ambiguous or exacting tasks
not always enough for hard reasoning or high-risk requests

Premium tier

Use for:

- deep reasoning
difficult code and architecture problems
long-context synthesis with subtle dependencies
high-value outputs
high-risk tasks requiring stronger judgment

Strengths:

- strongest reasoning
better ambiguity handling
better synthesis quality

Weaknesses:

- highest cost
often slower
overkill for simple tasks

Tool-assisted tier

Use when exactness matters more than fluent wording.

Use this path for:

- arithmetic
deterministic calculations
spreadsheet operations
formula application
structured data transformation
exact code execution or testing if available
retrieval-backed factual tasks

Rule:
When a task requires exact numeric correctness, prefer tools plus model orchestration over pure model reasoning.

Decision dimensions

Score the request across these dimensions:

1. Complexity

- low
medium
high
very high

2. Exactness requirement

- low: approximate answer is acceptable
medium: mostly correct is acceptable
high: exact result expected
critical: exact result plus verification required

3. Risk level

- low
medium
high

4. Latency priority

- urgent
normal
relaxed

5. Budget strategy

- minimize cost
balanced
quality-first

6. Context burden

- short
moderate
long
extreme

7. Modality burden

- text only
image or PDF
mixed inputs

Hard routing rules

Apply these rules before any soft optimization.

Exact calculation rule

If the task involves exact arithmetic, formulas, tables, accounting-like operations, unit-sensitive conversions, or step-sensitive logic:

- do not rely on a pure language-only route when tools are available
prefer tool-assisted execution
use a balanced or premium model only to interpret the task and explain results
add a verification step for high-impact numeric outputs

High-risk rule

If the task is high-risk:

- do not use economy-only routing as the final path
require either premium single-model reasoning with grounding or a model plus verifier workflow
add citations, checks, or a review pass when possible

Ambiguity rule

If the task is materially ambiguous and the answer quality depends on interpretation:

- use a stronger reasoning tier or a two-stage workflow
do not finalize on a cheap first-pass answer without clarification or review

Long-context rule

If the input is large or multi-document:

- prefer staged processing
use extraction or chunk summarization first
then use a stronger model for synthesis if needed
avoid sending everything to the strongest model by default if staged reduction is cheaper and safe

Multimodal rule

If the task includes images, diagrams, PDFs with layout dependence, or visual interpretation:

- use a model path that actually supports the required modality
do not route to a text-only path

Coding rule

For code tasks:

- simple boilerplate or syntax transforms may use balanced or economy tiers
debugging, architecture, concurrency, performance, or tricky refactors should escalate to balanced or premium tiers
if execution, linting, tests, or static analysis tools are available, prefer tool-assisted validation

Recommended workflows

Choose one of these workflow shapes.

1. Single economy

Use when:

- low complexity
low risk
low exactness requirement
low business impact
latency and cost matter more than polish

Examples:

- rewrite text
generate short summaries
classify intent
format content

2. Single balanced

Use when:

- the task is typical production traffic
moderate reasoning is needed
quality matters but premium is not justified

Examples:

- standard technical Q&A
ordinary product copy
moderate coding tasks
document understanding with limited ambiguity

3. Single premium

Use when:

- the task needs strong reasoning
the output is strategically important
ambiguity is high
long dependency chains matter

Examples:

- system design
complex debugging
nuanced tradeoff analysis
sensitive writing requiring higher judgment

4. Tool-assisted reasoning

Use when:

- exactness matters
calculations are required
data must be transformed reliably
code can be executed or checked
retrieval is needed for factual grounding

Pattern:

- model interprets the request
tools compute, retrieve, or validate
model explains and formats the result

5. Staged pipeline

Use when:

- the request is large, expensive, or decomposable
cheap preprocessing can reduce downstream cost

Pattern:

1. economy or balanced model for triage or extraction
balanced or premium model for synthesis
optional verifier pass

Examples:

- long-document analysis
large support threads
multi-file engineering review

6. Draft and review

Use when:

- low-cost drafting is possible but final quality matters

Pattern:

1. cheaper model drafts
stronger model critiques, corrects, or upgrades

Best for:

- writing
technical explanations
proposal drafting
code review style tasks

7. Parallel comparison

Use when:

- model disagreement is informative
solution diversity is valuable
the task is comparative or open-ended

Pattern:

1. two models produce independent answers
a stronger model or rule layer compares and merges

Best for:

- architecture options
planning alternatives
ambiguous recommendations

8. Consensus or verifier workflow

Use when:

- correctness matters enough to justify extra cost
false confidence is dangerous

Pattern:

1. primary model produces answer
verifier model checks logic, calculations, or policy fit
disagreements trigger escalation or explicit uncertainty

Best for:

- high-risk reasoning
important financial outputs
compliance-sensitive content
high-value technical decisions

Cost-control strategy

Use these strategies to keep cost high-value.

Default strategy

- start cheap when safe
escalate only on signals of failure risk
avoid premium for routine tasks
reuse extracted structure instead of repeating full-context calls

Escalation triggers

Escalate to a stronger model or multi-step workflow when any of these appear:

- multiple dependent reasoning steps
ambiguous user intent with multiple plausible interpretations
repeated self-contradiction in draft output
failure to follow structure or constraints
long context with subtle dependencies
code correctness matters beyond surface syntax
exactness-critical math or finance output
high-risk domain or high business impact

De-escalation triggers

Use a cheaper path when:

- the task is mostly formatting or rewriting
the answer can be approximate
the task is repetitive and pattern-based
first-pass triage is enough
premium capabilities would not materially improve the outcome

Complex calculation policy

When the request includes complex calculations or formal reasoning:

1. Separate interpretation from computation.
Use the model to parse the problem and define the method.
Use a deterministic tool or calculational path when available.
Ask a verifier layer to check assumptions, formulas, units, and edge cases for high-impact outputs.
Present the final answer with explicit assumptions and, when relevant, step order.

Never use a fluent but non-verified freeform model answer as the final authority for exact numeric work when a deterministic path exists.

Long-context policy

When the request includes large context:

- first extract relevant segments, summaries, or structured facts
reduce duplication
preserve citations or pointers when possible
synthesize only after reduction
use premium synthesis only if the reduced problem still demands it

Output format

Return exactly this structure:

Routing Decision:

Primary Reason:

Task Profile:

- taskType:
complexity:
exactness:
risk:
latency:
budget:
contextLoad:
modality:

Recommended Execution Plan:

Model Role Assignment:

- planner:
generator:
verifier:

Why This Is Cost-Effective:

Escalation Rule:

Fallback Rule:

Examples

Example 1: Simple rewrite

Request: "Rewrite this email to sound more professional."

Output:
Routing Decision:
single-economy

Primary Reason:
This is a low-risk rewriting task where a low-cost model should meet the quality bar.

Task Profile:

- taskType: simple generation
complexity: low
exactness: low
risk: low
latency: urgent
budget: minimize-cost
contextLoad: short
modality: text

Recommended Execution Plan:

1. Use an economy model to rewrite the email.
Preserve intent and tone constraints.
Return one polished version and one shorter variant.

Model Role Assignment:

- planner: not-needed
generator: economy
verifier: none

Why This Is Cost-Effective:

- The task is structurally simple.
Premium reasoning is unnecessary.
Fast low-cost generation is sufficient.

Escalation Rule:
Escalate only if the user requests highly nuanced persuasion, legal sensitivity, or multiple tone variants.

Fallback Rule:
If the first rewrite is too generic, retry once with stricter tone instructions using a balanced model.

Example 2: Complex system design

Request: "Design an AI agent architecture for long-form writing with memory, retrieval, conflict resolution, and chapter planning."

Output:
Routing Decision:
single-premium

Primary Reason:
This task requires deep reasoning, abstraction, and multi-part architecture tradeoff analysis.

Task Profile:

- taskType: deep reasoning
complexity: high
exactness: medium
risk: medium
latency: normal
budget: balanced
contextLoad: moderate
modality: text

Recommended Execution Plan:

1. Use a premium model to identify major architecture components and constraints.
Structure the answer into memory layers, retrieval flow, consistency controls, and orchestration.
Include tradeoffs and failure modes.
Add implementation recommendations only after the design is coherent.

Model Role Assignment:

- planner: premium
generator: premium
verifier: none

Why This Is Cost-Effective:

- A weaker model may produce shallow or inconsistent architecture.
One premium pass is cheaper than multiple failed lower-tier retries.
The task value justifies stronger reasoning.

Escalation Rule:
Escalate to parallel-comparison only if multiple competing architecture strategies are explicitly required.

Fallback Rule:
If the first answer is too generic, rerun with a staged outline-first then full-design workflow.

Example 3: Financial calculation with policy impact

Request: "Calculate the projected monthly burn and runway under three hiring scenarios and explain the assumptions."

Output:
Routing Decision:
tool-assisted

Primary Reason:
This request depends on exact calculations, so deterministic computation is safer and more cost-effective than pure model reasoning.

Task Profile:

- taskType: exact calculation or formal logic
complexity: high
exactness: critical
risk: high
latency: normal
budget: balanced
contextLoad: moderate
modality: text

Recommended Execution Plan:

1. Use a balanced model to extract the variables, hiring scenarios, and required formulas.
Use a deterministic calculator or spreadsheet-capable path to compute burn and runway.
Run a verification check on formulas, assumptions, and units.
Use the model to present the results clearly with assumptions and scenario breakdowns.

Model Role Assignment:

- planner: balanced
generator: tool-assisted
verifier: tool

Why This Is Cost-Effective:

- Exact arithmetic should not depend on freeform model reasoning.
Deterministic calculation reduces error risk.
A smaller model can still explain the results after computation.

Escalation Rule:
Escalate to premium review only if the scenario assumptions are ambiguous or strategically sensitive.

Fallback Rule:
If tool-based computation is unavailable, return assumptions explicitly and mark the result as unverified instead of presenting certainty.

Example 4: Multi-document review

Request: "Compare these five candidate proposals and recommend the best one based on cost, timeline, technical fit, and risk."

Output:
Routing Decision:
staged-pipeline

Primary Reason:
The request is large and decomposable, so staged extraction and synthesis is more cost-effective than sending everything directly to a premium model.

Task Profile:

- taskType: long-context synthesis
complexity: high
exactness: medium
risk: medium
latency: relaxed
budget: balanced
contextLoad: long
modality: text

Recommended Execution Plan:

1. Use an economy or balanced model to extract structured facts from each proposal.
Normalize the proposals into a common comparison table.
Use a stronger model to synthesize tradeoffs and recommend the best option.
Add a brief verifier pass if the recommendation is high stakes.

Model Role Assignment:

- planner: balanced
generator: staged-pipeline
verifier: balanced

Why This Is Cost-Effective:

- Cheap extraction lowers total token cost.
Structured normalization improves synthesis quality.
Premium reasoning is reserved for the part that truly needs it.

Escalation Rule:
Escalate to consensus-check if the recommendation will drive a major decision or if proposal differences are subtle.

Fallback Rule:
If extraction quality is poor, rerun the extraction stage with a stronger model before recomputing the final recommendation.

生产模型路由器

概述

使用此技能来决定应使用哪个模型层级、工作流形态和验证策略来处理用户的请求。

目标是最大化成本效益，同时不牺牲任务适配性、正确性或运营可靠性。

此技能不会盲目选择最强的模型。它会选择最便宜的、仍能满足任务质量标准的可行路径。

它可能推荐：

- 单个低成本模型
单个均衡模型
单个高级模型
工具辅助模型工作流
分阶段多模型流水线
并行比较工作流
草稿与审查工作流
共识或验证器工作流

主要目标

对于每个请求，选择能够满足以下条件的最低成本执行路径：

- 任务质量
正确性要求
延迟预期
安全或风险约束
输出格式需求
工具和模态要求

何时使用

当需要决定以下事项时使用此技能：

- 哪个模型应回答给定的用户请求
低成本模型是否足够
何时升级到更强的推理模型
何时使用单个模型与多个模型
何时使用工具而非依赖纯模型推理
如何处理复杂计算、代码、多模态输入、长上下文或高风险任务
如何平衡生产环境中的成本、速度和回答质量

不要使用

不要使用此技能来：

- 直接回答原始业务问题
在没有环境或配置证据的情况下虚构模型能力
假设最昂贵的模型总是最佳选择
在没有验证的情况下将高风险精确任务路由到低成本模型
在工具可用时依赖纯语言生成进行精确算术

需要收集的输入

从请求和系统上下文中收集或推断以下信息：

请求特征

- 任务类型
领域
预期输出类型
是否存在图像、文件、表格、代码或长文档
对精确性与近似有用性的需求
请求是开放式的还是精度关键的

执行约束

- 预算敏感度
延迟敏感度
质量预期
Token或上下文大小压力
工具可用性
是否需要引用或可追溯性
是否需要可复现性

风险概况

- 低风险
中风险
高风险

容错能力

- 粗略答案是否可接受
答案是否必须经过验证
模型之间的分歧是否有价值

任务分类

将请求归类为以下一个或多个类别：

1. 简单生成

- 重写 - 摘要 - 格式化 - 轻度翻译 - 基础头脑风暴

2. 一般推理

- 解释 - 比较 - 概念映射 - 常规业务分析

3. 深度推理

- 多步骤规划 - 权衡分析 - 架构设计 - 模糊决策支持 - 链式依赖推理

4. 精确计算或形式逻辑

- 算术 - 财务计算 - 单位换算 - 类电子表格推理 - 符号或步骤敏感的数学 - 精确性重要的组合或逻辑谜题

5. 编码和技术执行

- 代码生成 - 调试 - 重构 - 测试生成 - 查询编写 - 基础设施或API设计

6. 长上下文综合

- 大型文档 - 多个文件 - 多源比较 - 转录或合同审查

7. 多模态任务

- 图像理解 - 图表解读 - 布局密集的PDF - 视频或音频相关任务（如支持）

8. 高风险任务

- 医疗 - 法律 - 财务决策 - 合规 - 安全敏感操作 - 任何错误建议会产生重大后果的事项

核心路由原则

始终优先选择能够安全成功的最便宜路径。

按以下优先级顺序应用：

1. 低成本单模型路径
均衡单模型路径
高级单模型路径
工具辅助路径
分阶段多模型路径
并行多模型比较
高级加验证器或共识工作流

除非任务特征证明有必要，否则不要升级。

模型层级

除非部署指定了具体的提供商，否则使用抽象能力层级。

经济层

用于：

- 简单重写
格式化
低风险分类
简短摘要
轻量提取
初筛分类

优势：

- 最低成本
快速响应
适合直接任务

劣势：

- 深度推理较弱
对模糊性更脆弱
在精确性关键任务上表现较差

均衡层

用于：

- 日常产品和工程工作
标准推理
中等代码任务
中等文档分析
大多数业务和写作任务

优势：

- 良好的质量-成本权衡
处理大多数正常生产流量
合理的速度和鲁棒性

劣势：

- 在高度模糊或要求严格的任务上仍可能失败
对于困难推理或高风险请求不一定足够

高级层

用于：

- 深度推理
困难的代码和架构问题
具有微妙依赖关系的长上下文综合
高价值输出
需要更强判断力的高风险任务

优势：

- 最强推理能力
更好的模糊处理能力
更好的综合质量

劣势：

- 最高成本
通常较慢
对于简单任务过度

工具辅助层

当精确性比流畅措辞更重要时使用。

用于以下路径：

- 算术
确定性计算
电子表格操作
公式应用
结构化数据转换
精确代码执行或测试（如可用）
基于检索的事实性任务

规则：
当任务需要精确数值正确性时，优先选择工具加模型编排而非纯模型推理。

决策维度

在以下维度上对请求进行评分：

1. 复杂度

- 低
中
高
非常高

2. 精确性要求

- 低：近似答案可接受
中：基本正确可接受
高：预期精确结果
关键：需要精确结果加验证

3. 风险等级

- 低
中
高

4. 延迟优先级

- 紧急
正常
宽松

5. 预算策略

- 最小化成本
均衡
质量优先

6. 上下文负担

- 短
中等
长
极长

7. 模态负担

- 仅文本
图像或PDF
混合输入

硬路由规则

在任何软优化之前应用这些规则。

精确计算规则

如果任务涉及精确算术、公式、表格、类会计操作、单位敏感转换或步骤敏感逻辑：

- 在工具可用时不要依赖纯语言路径
优先选择工具辅助执行
仅使用均衡或高级模型来解释任务和说明结果
对高影响数值输出添加验证步骤

高风险规则

如果任务属于高风险：

- 不要将经济层单独路由作为最终路径
需要高级单模型推理加接地或模型加验证器工作流
尽可能添加引用、检查或审查环节

模糊性规则

如果任务存在实质性模糊且答案质量取决于解读：

- 使用更强的推理层或两阶段工作流
在没有澄清或审查的情况下不要以低成本的初筛答案作为最终结果

长上下文规则

如果输入较大或多文档：

- 优先选择分阶段处理
先进行提取或分块摘要
然后根据需要，使用更强的模型进行综合
如果分阶段缩减更便宜且安全，避免默认将所有内容发送给最强模型

多模态规则

如果任务包含图像、图表、依赖布局的PDF或视觉解读：

- 使用实际支持所需模态的模型路径
不要路由到仅文本路径

编码规则

对于代码任务：

- 简单的样板代码或语法转换可使用均衡或经济层
调试、架构、并发、性能或棘手的重构应升级到均衡或高级层
如果执行、代码检查、测试或静态分析工具可用，优先选择工具辅助验证

production-model-router生产模型路由