AI Orchestration Skill — Multi-Agent Systems
Design, coordinate, and evaluate multi-agent AI systems. This skill covers agent architecture, prompt engineering patterns, eval-driven development, and context management for AI agents.
When to Use This Skill
Explicit Triggers
- - "Design multi-agent system"
- "Orchestrate AI agents"
- "Write prompt for agent"
- "Build eval framework"
- "Coordinate parallel AI tasks"
Implicit Detection
- - Complex task requiring specialization
- Multiple independent subtasks
- Need for agent communication
- Evaluating AI output quality
- Managing context across agents
Multi-Agent Architecture
Agent Decomposition Pattern
Break complex tasks into specialized agents:
CODEBLOCK0
Principles:
- - Each agent has single responsibility
- Orchestrator manages high-level state
- Agents communicate via structured output
- Sub-agents receive focused, complete instructions
Parallel vs Sequential
CODEBLOCK1
Decision Guide:
- - Use parallel when tasks are independent
- Use sequential when output of one task is input to next
- Mix patterns for complex workflows
Context Isolation Strategy
CODEBLOCK2
Benefits:
- - Each agent stays focused
- Context windows remain manageable
- Easier to debug individual agents
- Can parallelize independent agents
Prompt Engineering Patterns
Role-Task-Context-Format (RTCF)
CODEBLOCK3
Chain of Thought (CoT)
CODEBLOCK4
Few-Shot Learning
CODEBLOCK5
Structured Output
CODEBLOCK6
Eval-Driven Development (EDD)
Define Evals Before Building
CODEBLOCK7
Evaluation Criteria
| Criteria | Weight | How to Measure |
|---|
| Correctness | 40% | Output matches expected result |
| Completeness |
30% | All required elements present |
| Safety | 15% | No harmful/biased content |
| Format | 10% | Follows requested structure |
| Conciseness | 5% | Token count efficiency |
A/B Testing Prompts
CODEBLOCK8
Use Cases
Use Case 1: Code Review Multi-Agent System
Scenario: Automated security and quality review for pull requests
Agent Architecture:
CODEBLOCK9
Prompt for Security Reviewer:
CODEBLOCK10
Orchestrator Logic:
CODEBLOCK11
Use Case 2: Feature Implementation Workflow
Scenario: Implement OAuth authentication for WordPress plugin
Agent Workflow (Sequential):
Step 1: Planner Agent
CODEBLOCK12
Output:
CODEBLOCK13
Step 2: Implementer Agent
CODEBLOCK14
Step 3: Tester Agent
CODEBLOCK15
Step 4: Reviewer Agent
CODEBLOCK16
Use Case 3: Parallel Content Generation
Scenario: Generate documentation, tests, and examples for new API
Agent Workflow (Parallel):
CODEBLOCK17
All agents run simultaneously:
Doc Generator:
CODEBLOCK18
Test Generator:
CODEBLOCK19
Example Generator:
CODEBLOCK20
Orchestrator aggregates:
CODEBLOCK21
Agent Communication Patterns
Structured Handoff
CODEBLOCK22
Error Recovery
CODEBLOCK23
Context Window Management
What to Include
CODEBLOCK24
Progressive Context Loading
CODEBLOCK25
Context Budget Allocation
CODEBLOCK26
Anti-Patterns
God Agent
Problem: One agent doing everything
- - Too many responsibilities
- Loses focus and quality
- Context window overload
Solution: Split into specialized agents
Blind Delegation
Problem: Launching agents without clear success criteria
- - Agents may produce wrong output
- No way to verify quality
- Wastes tokens and time
Solution: Define success criteria before delegation
Context Overload
Problem: Stuffing entire codebase into prompt
- - Hits token limits quickly
- Agent gets confused
- Slow and expensive
Solution: Progressive context loading, summarize first
Eval-Free Development
Problem: Shipping prompts without measuring quality
- - Don't know what works
- Can't improve prompts
- Risk of poor performance
Solution: Eval-driven development, measure everything
Retry Loops
Problem: Retrying same failing approach without adjustment
- - Infinite loops possible
- Wastes tokens
- No progress
Solution: Adjust approach after each failure
Integration Points
- - skill-manager: Manage agent configurations
- verification-loop: Validate agent outputs
- continuous-learning-v2: Learn from agent interactions
- strategic-compact: Manage context windows
Best Practices
- 1. Define success criteria before building agents
- Start with one agent, split only when needed
- Measure everything (latency, cost, accuracy)
- Version your prompts like you version code
- Use structured output (JSON) for agent-to-agent communication
- Add guardrails (output validation, content filtering)
- Log all interactions for debugging and improvement
Quick Reference
CODEBLOCK27
Related Skills
- - skill-manager: Agent orchestration and management
- verification-loop: Output validation
- continuous-learning-v2: Pattern extraction
- strategic-compact: Context management
Remember: Good orchestration starts with clear agent responsibilities, structured communication, and measurable success criteria.
AI编排技能 — 多智能体系统
设计、协调和评估多智能体AI系统。本技能涵盖智能体架构、提示工程模式、评估驱动开发和AI智能体的上下文管理。
何时使用此技能
显式触发条件
- - 设计多智能体系统
- 编排AI智能体
- 为智能体编写提示
- 构建评估框架
- 协调并行AI任务
隐式检测
- - 需要专业化的复杂任务
- 多个独立子任务
- 需要智能体间通信
- 评估AI输出质量
- 跨智能体管理上下文
多智能体架构
智能体分解模式
将复杂任务分解为专业化智能体:
编排器(主上下文)
├── 规划智能体 → 设计方案,识别文件
├── 实现智能体 → 按计划编写代码
├── 审查智能体 → 审查代码质量/安全性
├── 测试智能体 → 编写并运行测试
└── 文档智能体 → 更新文档和README
原则:
- - 每个智能体承担单一职责
- 编排器管理高层状态
- 智能体通过结构化输出通信
- 子智能体接收聚焦且完整的指令
并行与串行
并行(独立任务):
- - 安全审查 + 性能审查 + 类型检查
- 跨不同目录的多个文件搜索
- 独立仓库增强
串行(依赖任务):
- - 规划 → 实现 → 测试 → 审查
- 读取文件 → 编辑文件 → 验证编辑
- 克隆仓库 → 创建分支 → 修改 → 推送 → 创建PR
决策指南:
- - 任务独立时使用并行
- 一个任务的输出是下一个任务的输入时使用串行
- 复杂工作流中混合使用两种模式
上下文隔离策略
主上下文(编排器):
- - 保持高层状态和进度
- 将详细工作委托给子智能体
- 汇总子智能体的结果
子智能体上下文:
- - 接收聚焦且完整的指令
- 可访问工具但上下文受限
- 向编排器返回结构化摘要
- 不查看其他子智能体的输出
优势:
- - 每个智能体保持专注
- 上下文窗口保持可控
- 更容易调试单个智能体
- 可以并行化独立智能体
提示工程模式
角色-任务-上下文-格式(RTCF)
角色:你是一位拥有10年经验的高级安全工程师,
专精于认证系统和OAuth实现。
任务:按照OWASP Top 10审查此代码的漏洞。
上下文:这是一个使用JWT令牌处理用户认证的
Express.js API。该API每天有1万+用户使用。
格式:对于每个问题,提供:
- 严重程度:严重|高|中|低
- 文件:行号
- 描述
- 推荐修复方案
- CVSS评分(如适用)
现在开始审查:
思维链(CoT)
逐步思考:
- 1. 首先,识别代码的功能
- 然后,检查输入验证
- 接下来,追踪从输入到输出的数据流
- 最后,识别任何未净化数据到达敏感操作的点
将此应用于以下代码:
少样本学习
以下是优秀提交消息的示例:
feat(dashboard): 添加威胁严重程度图表
- - 按类别显示威胁级别
- 可按严重程度交互式筛选
- 链接到详细威胁报告
fix(api): 处理代理服务器超时
- - 添加连接超时(30秒)
- 实现重试逻辑(3次尝试)
- 添加断路器模式
security: 向Express中间件添加CSP头
- - 添加Content-Security-Policy头
- 仅允许同源脚本
- 阻止内联脚本执行
现在为这些更改编写提交消息:
[git diff输出]
结构化输出
以JSON格式返回安全分析:
{
summary: 总计:5个问题(1个严重,2个高,2个中),
issues: [
{
severity: critical,
category: injection,
file: routes/auth.js,
line: 42,
description: 登录查询中的SQL注入漏洞,
fix: 使用参数化查询,
cve_potential: true
}
],
recommendations: [
实现输入验证中间件,
为认证端点添加速率限制,
对所有查询使用预编译语句
]
}
不要在JSON之外包含任何文本。
评估驱动开发(EDD)
在构建前定义评估
python
eval_suite.py
from typing import List
from dataclasses import dataclass
@dataclass
class EvalCase:
input: str
expected: str
criteria: List[str]
@dataclass
class EvalResult:
score: float
passed: bool
feedback: str
class PromptEval:
def init(self, prompt_template: str):
self.template = prompt_template
self.test_cases: List[EvalCase] = []
def add_case(self, input: str, expected: str, criteria: List[str]):
为提示添加测试用例。
self.test_cases.append(EvalCase(input, expected, criteria))
def run(self, model: str) -> dict:
运行所有测试用例并返回结果。
results = []
for case in self.test_cases:
# 使用输入格式化提示
formatted_prompt = self.template.format(input=case.input)
# 调用模型
output = callmodel(model, formattedprompt)
# 评估输出
score = self._evaluate(output, case.expected, case.criteria)
results.append({
input: case.input,
output: output,
expected: case.expected,
score: score
})
return {
prompt: self.template,
model: model,
results: results,
avg_score: sum(r[score] for r in results) / len(results)
}
def _evaluate(self, output: str, expected: str, criteria: List[str]) -> float:
根据预期结果和标准对输出评分。
score = 0.0
# 正确性(40%)
if expected.lower() in output.lower():
score += 0.4
# 完整性(30%)
for criterion in criteria:
if criterion.lower() in output.lower():
score += 0.1
# 格式/结构(30%)
if self.iswell_formatted(output):
score += 0.3
return min(score, 1.0)
def iswell_formatted(self, output: str) -> bool:
检查输出是否遵循预期结构。
# 实现格式验证
return len(output.split(\n)) >= 3
使用
eval_suite = PromptEval(
总结这篇文章:{input}\n\n提供3个要点。
)
evalsuite.addcase(
input=AI正在改变医疗保健...,
expected=AI在医疗保健中的应用,
criteria=[机器学习, 诊断, 治疗]
)
evalsuite.addcase(
input=气候变化影响...,
expected=气候变化,
criteria=[气温上升, 极端天气, 解决方案]
)
results = eval_suite.run(claude-sonnet-4)
print(f平均分:{results[avg_score]:.2f})
评估标准
| 标准 | 权重 | 如何衡量 |
|---|
| 正确性 | 40% | 输出匹配预期结果 |
| 完整性 |
30% | 所有必需元素存在 |
| 安全性 | 15% | 无有害/偏见内容 |
| 格式 | 10% | 遵循请求的结构 |
| 简洁性 | 5% | Token计数效率 |
A/B测试提示
python
from prompt_eval import PromptEval
提示A:直接指令
prompt_a = 用3个要点总结这篇文章。
提示B:带示例的结构化提示
prompt_b = 从这篇文章中提取3个最重要的事实。
格式化为以动作动词开头的要点。
示例:
✓ 实现功能X以解决Y
✓ 重构模块Z以获得更好性能
✓ 添加覆盖边界情况的测试
文章:{input}
在两个提示上运行评估
eval
a = PromptEval(prompta)
eval
b = PromptEval(promptb)
添加相同的测试用例
for case in test_cases:
eval
a.addcase(case.input, case.expected, case.criteria)
eval
b.addcase(case.input, case.expected, case.criteria)
比较结果
results
a = evala.run(claude-sonnet-4)
results
b = evalb