Production Agent Design
Core Principle
The LLM is the reasoning engine. Your code is the execution engine. The loop is the contract between them.
Every production concern — safety, cost, retries, logging, permissions — lives in the harness, not the prompt. A prompt that says "be careful with deletions" is a suggestion. A GuardedToolNode that intercepts delete_* calls is a guarantee.
When to Use This Skill
- - Designing a new multi-agent system from scratch
- Adding safety, cost controls, or observability to an existing agent
- Debugging runaway cost, infinite loops, or context window exhaustion
- Choosing between single-agent vs multi-agent topology
- Implementing human-in-the-loop (HITL) for irreversible actions
- Setting up session persistence and resumption
Architecture at a Glance
CODEBLOCK0
Single Agent vs Multi-Agent
CODEBLOCK1
Rule: Start with a single agent. Add multi-agent complexity only when you hit a concrete limit — context window size, tool set sprawl, latency, or accuracy.
Framework Selection
| Need | Use |
|---|
| Complex branching, HITL, durable persistence, fine-grained control | LangGraph |
| Simple loop, minimal boilerplate, rapid prototype, leaf agents |
Strands |
| Orchestration graph + simple leaf agents |
LangGraph + Strands hybrid |
Reference Files
Load these on demand using the triggers listed below. Do not load all of them upfront.
Decomposing tasks, spawning subagents, implementing plan-then-execute |
|
references/tool-safety-layer.md | Designing tools, adding permission rules, implementing HITL or killswitch |
|
references/memory-layer.md | Context window approaching limit, adding long-term memory, injecting project context |
|
references/observability-layer.md | Adding tracing, tracking token cost, debugging agent behavior, setting up alerts |
|
references/resilience-layer.md | Adding retry logic, circuit breakers, preventing infinite loops |
|
references/persistence-layer.md | Choosing a checkpointer, implementing session resume, session branching |
|
references/production-checklist.md | Before deploying to production — full ~40-point readiness checklist |
Quick Reference
| Pattern | Key implementation | Reference |
|---|
| Intent routing | INLINECODE2 + confidence threshold | INLINECODE3 |
| Scoped subagents |
create_react_agent with tool subset |
orchestrator-layer.md |
| Plan-then-execute | Two nodes, read-only tools in plan phase |
orchestrator-layer.md |
| Tool schema |
args_schema=PydanticModel on
@tool |
tool-safety-layer.md |
| Permission guard |
GuardedToolNode with
PermissionRule list |
tool-safety-layer.md |
| HITL interrupt |
interrupt() +
Command(resume=...) |
tool-safety-layer.md |
| Runtime concurrency |
is_concurrency_safe(input) per tool call |
tool-safety-layer.md |
| Abort hierarchy | Query-level abort + sibling-level child abort |
tool-safety-layer.md |
| Tiered compaction | budget → snip → microcompact → autocompact |
memory-layer.md |
| Auto-compaction | Summarization node at 80% context |
memory-layer.md |
| Context injection |
AGENT.md loaded into system prompt |
memory-layer.md |
| Full trace |
BaseCallbackHandler + structured events |
observability-layer.md |
| Cost tracking | Per-turn token accounting in callback |
observability-layer.md |
| Config snapshot | Freeze all feature flags at query entry |
observability-layer.md |
| Diminishing returns | Track token deltas; stop if delta < 500 × 2 |
resilience-layer.md |
| Output limit escalation | Escalate to 64k tokens before compaction |
resilience-layer.md |
| Streaming cleanup | Tombstone partial messages on fallback |
resilience-layer.md |
| Error-as-observation |
try/except →
ToolMessage |
resilience-layer.md |
| Circuit breaker | State machine wrapping tool fn |
resilience-layer.md |
| Session resume | Checkpointer + stable
thread_id |
persistence-layer.md |
Gotchas
- - Safety rules must be code, not prompts. A prompt saying "don't delete production data" is not a safety control.
- Never dump the full parent message history into a subagent. Pass only the specific task and relevant data — context pollution degrades performance and wastes tokens.
InMemorySaver is for development only. Use Redis or Postgres checkpointers in production.interrupt() pauses the graph. Resume it by calling graph.invoke(Command(resume=...), config=config) — forgetting this leaves the agent stuck.- Tool result truncation is mandatory. Large tool outputs (file reads, search results) will exhaust the context window if not truncated before returning.
- Always set
max_iterations. Without a loop guard, a miscalibrated agent runs indefinitely and incurs unbounded cost. - Apply compaction in tiers. Budget tool results → snip → microcompact → autocompact. Jumping straight to full summarization wastes tokens when a cheaper step would suffice.
- Track diminishing returns, not just token budget. An agent can burn through its iteration budget producing nearly empty continuations. Stop when the last 2 deltas are both below ~500 tokens.
- Snapshot config at query entry. Never re-read feature flags or env vars mid-turn — a remote config change during a 30-second response causes inconsistent behavior within a single turn.
- Concurrency safety must be checked at runtime. Schema metadata cannot determine if a bash command is safe — inspect the actual input string at call time. Fail conservatively (serial) if parsing fails.
生产级智能体设计
核心原则
大语言模型是推理引擎。你的代码是执行引擎。循环是它们之间的契约。
所有生产级关注点——安全性、成本、重试、日志记录、权限——都存在于框架中,而非提示词中。一条提示词说小心删除操作只是一个建议。一个拦截delete_*调用的GuardedToolNode才是保障。
何时使用本技能
- - 从零开始设计新的多智能体系统
- 为现有智能体添加安全性、成本控制或可观测性
- 调试失控成本、无限循环或上下文窗口耗尽问题
- 在单智能体与多智能体拓扑之间做出选择
- 对不可逆操作实施人在回路(HITL)
- 设置会话持久化和恢复
架构概览
入口(HTTP / CLI / Webhook / 调度)
│
路由层 — 分类意图,低成本分发
│
编排器 — 分解任务,委派给专家
├── 智能体A(限定作用域的工具)
└── 智能体B(限定作用域的工具)
│
工具层 — 验证模式 → 检查权限 → 执行 → 截断
│
横切关注点
├── 记忆 (短期 / 工作 / 长期)
├── 可观测性 (追踪、成本、会话回放)
└── 弹性 (重试、断路器、循环防护)
│
持久化 — 检查点(Redis / Postgres)+ 审计日志
单智能体 vs 多智能体
任务限定在单一领域?
是 → 带适当工具的单一ReAct智能体
否 → 独立子任务?
是 → 并行多智能体(监督者 + 专家)
否 → 顺序/层次化编排器
│
是否有需要人工审核的不可逆步骤?
是 → 先规划后执行,带HITL中断
否 → 带自动委派的编排器
规则: 从单一智能体开始。只有在遇到具体限制——上下文窗口大小、工具集膨胀、延迟或准确性——时才增加多智能体复杂性。
框架选择
| 需求 | 使用 |
|---|
| 复杂分支、HITL、持久化持久性、细粒度控制 | LangGraph |
| 简单循环、最小样板代码、快速原型、叶子智能体 |
Strands |
| 编排图 + 简单叶子智能体 |
LangGraph + Strands 混合 |
参考文件
使用下面列出的触发器按需加载这些文件。不要一次性全部加载。
分解任务、生成子智能体、实现先规划后执行 |
|
references/tool-safety-layer.md | 设计工具、添加权限规则、实现HITL或终止开关 |
|
references/memory-layer.md | 上下文窗口接近限制、添加长期记忆、注入项目上下文 |
|
references/observability-layer.md | 添加追踪、跟踪令牌成本、调试智能体行为、设置告警 |
|
references/resilience-layer.md | 添加重试逻辑、断路器、防止无限循环 |
|
references/persistence-layer.md | 选择检查点器、实现会话恢复、会话分支 |
|
references/production-checklist.md | 部署到生产环境前——完整的约40点就绪检查清单 |
快速参考
| 模式 | 关键实现 | 参考 |
|---|
| 意图路由 | conditionaledges + 置信度阈值 | router-layer.md |
| 限定作用域的子智能体 |
带工具子集的createreact_agent | orchestrator-layer.md |
| 先规划后执行 | 两个节点,规划阶段使用只读工具 | orchestrator-layer.md |
| 工具模式 | @tool上的args_schema=PydanticModel | tool-safety-layer.md |
| 权限守卫 | 带PermissionRule列表的GuardedToolNode | tool-safety-layer.md |
| HITL中断 | interrupt() + Command(resume=...) | tool-safety-layer.md |
| 运行时并发 | 每次工具调用检查is
concurrencysafe(input) | tool-safety-layer.md |
| 中止层级 | 查询级中止 + 同级子级中止 | tool-safety-layer.md |
| 分层压缩 | 预算 → 剪裁 → 微压缩 → 自动压缩 | memory-layer.md |
| 自动压缩 | 上下文达80%时的摘要节点 | memory-layer.md |
| 上下文注入 | 将AGENT.md加载到系统提示词中 | memory-layer.md |
| 完整追踪 | BaseCallbackHandler + 结构化事件 | observability-layer.md |
| 成本追踪 | 回调中的每轮令牌计数 | observability-layer.md |
| 配置快照 | 在查询入口冻结所有功能标志 | observability-layer.md |
| 收益递减 | 跟踪令牌增量;如果增量 < 500 × 2则停止 | resilience-layer.md |
| 输出限制升级 | 在压缩前升级到64k令牌 | resilience-layer.md |
| 流式清理 | 回退时对部分消息进行墓碑标记 | resilience-layer.md |
| 错误即观察 | try/except → ToolMessage | resilience-layer.md |
| 断路器 | 包装工具函数的状态机 | resilience-layer.md |
| 会话恢复 | 检查点器 + 稳定的thread_id | persistence-layer.md |
注意事项
- - 安全规则必须是代码,而非提示词。 一条提示词说不要删除生产数据不是安全控制措施。
- 永远不要将完整的父消息历史转储到子智能体中。 只传递具体任务和相关数据——上下文污染会降低性能并浪费令牌。
- InMemorySaver仅用于开发。 生产环境中使用Redis或Postgres检查点器。
- interrupt()会暂停图。 通过调用graph.invoke(Command(resume=...), config=config)恢复它——忘记这一步会让智能体卡住。
- 工具结果截断是强制性的。 大型工具输出(文件读取、搜索结果)如果在返回前不截断,会耗尽上下文窗口。
- 始终设置max_iterations。 没有循环防护,校准不当的智能体会无限运行并产生无界成本。
- 分层应用压缩。 预算工具结果 → 剪裁 → 微压缩 → 自动压缩。当更便宜的步骤就足够时,直接跳到完整摘要会浪费令牌。
- 追踪收益递减,而不仅仅是令牌预算。 智能体可能耗尽迭代预算却产生几乎空白的延续。当最后2个增量都低于约500令牌时停止。
- 在查询入口快照配置。 永远不要在轮次中间重新读取功能标志或环境变量——在30秒响应期间远程配置更改会导致单轮内行为不一致。
- 并发安全必须在运行时检查。 模式元数据无法确定bash命令是否安全——在调用时检查实际输入字符串。如果解析失败,保守地失败(串行)。