LangGraph Architecture Decisions
When to Use LangGraph
Use LangGraph When You Need:
- - Stateful conversations - Multi-turn interactions with memory
- Human-in-the-loop - Approval gates, corrections, interventions
- Complex control flow - Loops, branches, conditional routing
- Multi-agent coordination - Multiple LLMs working together
- Persistence - Resume from checkpoints, time travel debugging
- Streaming - Real-time token streaming, progress updates
- Reliability - Retries, error recovery, durability guarantees
Consider Alternatives When:
| Scenario | Alternative | Why |
|---|
| Single LLM call | Direct API call | Overhead not justified |
| Linear pipeline |
LangChain LCEL | Simpler abstraction |
| Stateless tool use | Function calling | No persistence needed |
| Simple RAG | LangChain retrievers | Built-in patterns |
| Batch processing | Async tasks | Different execution model |
State Schema Decisions
TypedDict vs Pydantic
| TypedDict | Pydantic |
|---|
| Lightweight, faster | Runtime validation |
| Dict-like access |
Attribute access |
| No validation overhead | Type coercion |
| Simpler serialization | Complex nested models |
Recommendation: Use TypedDict for most cases. Use Pydantic when you need validation or complex nested structures.
Reducer Selection
| Use Case | Reducer | Example |
|---|
| Chat messages | INLINECODE0 | Handles IDs, RemoveMessage |
| Simple append |
operator.add |
Annotated[list, operator.add] |
| Keep latest | None (LastValue) |
field: str |
| Custom merge | Lambda |
Annotated[list, lambda a, b: ...] |
| Overwrite list |
Overwrite | Bypass reducer |
State Size Considerations
CODEBLOCK0
Graph Structure Decisions
Single Graph vs Subgraphs
Single Graph when:
- - All nodes share the same state schema
- Simple linear or branching flow
- < 10 nodes
Subgraphs when:
- - Different state schemas needed
- Reusable components across graphs
- Team separation of concerns
- Complex hierarchical workflows
Conditional Edges vs Command
| Conditional Edges | Command |
|---|
| Routing based on state | Routing + state update |
| Separate router function |
Decision in node |
| Clearer visualization | More flexible |
| Standard patterns | Dynamic destinations |
CODEBLOCK1
Static vs Dynamic Routing
Static Edges (add_edge):
- - Fixed flow known at build time
- Clearer graph visualization
- Easier to reason about
Dynamic Routing (add_conditional_edges, Command, Send):
- - Runtime decisions based on state
- Agent-driven navigation
- Fan-out patterns
Persistence Strategy
Checkpointer Selection
| Checkpointer | Use Case | Characteristics |
|---|
| INLINECODE10 | Testing only | Lost on restart |
| INLINECODE11 |
Development | Single file, local |
|
PostgresSaver | Production | Scalable, concurrent |
| Custom | Special needs | Implement BaseCheckpointSaver |
Checkpointing Scope
CODEBLOCK2
When to Disable Checkpointing
- - Short-lived subgraphs that should be atomic
- Subgraphs with incompatible state schemas
- Performance-critical paths without need for resume
Multi-Agent Architecture
Supervisor Pattern
Best for:
- - Clear hierarchy
- Centralized decision making
- Different agent specializations
CODEBLOCK3
Peer-to-Peer Pattern
Best for:
- - Collaborative agents
- No clear hierarchy
- Flexible communication
CODEBLOCK4
Handoff Pattern
Best for:
- - Sequential specialization
- Clear stage transitions
- Different capabilities per stage
CODEBLOCK5
Streaming Strategy
Stream Mode Selection
| Mode | Use Case | Data |
|---|
| INLINECODE13 | UI updates | Node outputs only |
| INLINECODE14 |
State inspection | Full state each step |
|
messages | Chat UX | LLM tokens |
|
custom | Progress/logs | Your data via StreamWriter |
|
debug | Debugging | Tasks + checkpoints |
Subgraph Streaming
CODEBLOCK6
Human-in-the-Loop Design
Interrupt Placement
| Strategy | Use Case |
|---|
| INLINECODE18 | Approval before action |
| INLINECODE19 |
Review after completion |
|
interrupt() in node | Dynamic, contextual pauses |
Resume Patterns
CODEBLOCK7
Error Handling Strategy
Retry Configuration
CODEBLOCK8
Fallback Patterns
CODEBLOCK9
Scaling Considerations
Horizontal Scaling
- - Use PostgresSaver for shared state
- Consider LangGraph Platform for managed infrastructure
- Use stores for large data outside checkpoints
Performance Optimization
- 1. Minimize state size - Use references for large data
- Parallel nodes - Fan out when possible
- Cache expensive operations - Use CachePolicy
- Async everywhere - Use ainvoke, astream
Resource Limits
CODEBLOCK10
Decision Checklist
Before implementing:
- 1. [ ] Is LangGraph the right tool? (vs simpler alternatives)
- [ ] State schema defined with appropriate reducers?
- [ ] Persistence strategy chosen? (dev vs prod checkpointer)
- [ ] Streaming needs identified?
- [ ] Human-in-the-loop points defined?
- [ ] Error handling and retry strategy?
- [ ] Multi-agent coordination pattern? (if applicable)
- [ ] Resource limits configured?
LangGraph架构决策
何时使用LangGraph
在以下情况下使用LangGraph:
- - 有状态对话 - 具有记忆的多轮交互
- 人在回路中 - 审批关卡、修正、干预
- 复杂控制流 - 循环、分支、条件路由
- 多智能体协调 - 多个LLM协同工作
- 持久化 - 从检查点恢复、时间旅行调试
- 流式传输 - 实时令牌流式传输、进度更新
- 可靠性 - 重试、错误恢复、持久性保证
考虑替代方案的情况:
| 场景 | 替代方案 | 原因 |
|---|
| 单次LLM调用 | 直接API调用 | 开销不合理 |
| 线性流水线 |
LangChain LCEL | 更简单的抽象 |
| 无状态工具使用 | 函数调用 | 无需持久化 |
| 简单RAG | LangChain检索器 | 内置模式 |
| 批处理 | 异步任务 | 不同的执行模型 |
状态模式决策
TypedDict vs Pydantic
| TypedDict | Pydantic |
|---|
| 轻量级,更快 | 运行时验证 |
| 字典式访问 |
属性访问 |
| 无验证开销 | 类型强制转换 |
| 更简单的序列化 | 复杂嵌套模型 |
建议:大多数情况下使用TypedDict。需要验证或复杂嵌套结构时使用Pydantic。
归约器选择
| 用例 | 归约器 | 示例 |
|---|
| 聊天消息 | add_messages | 处理ID、RemoveMessage |
| 简单追加 |
operator.add | Annotated[list, operator.add] |
| 保留最新 | 无(LastValue) | field: str |
| 自定义合并 | Lambda | Annotated[list, lambda a, b: ...] |
| 覆盖列表 | Overwrite | 绕过归约器 |
状态大小考虑
python
小状态(< 1MB)- 放入状态
class State(TypedDict):
messages: Annotated[list, add_messages]
context: str
大数据 - 使用存储
class State(TypedDict):
messages: Annotated[list, add_messages]
document_ref: str # 存储引用
def node(state, *, store: BaseStore):
doc = store.get(namespace, state[document_ref])
# 处理而不使检查点膨胀
图结构决策
单图 vs 子图
单图适用于:
- - 所有节点共享相同状态模式
- 简单线性或分支流程
- 节点数 < 10
子图适用于:
- - 需要不同状态模式
- 跨图可复用组件
- 团队关注点分离
- 复杂分层工作流
条件边 vs 命令
| 条件边 | 命令 |
|---|
| 基于状态的路由 | 路由 + 状态更新 |
| 独立路由函数 |
节点内决策 |
| 更清晰的可视化 | 更灵活 |
| 标准模式 | 动态目标 |
python
条件边 - 当路由是重点时
def router(state) -> Literal[a, b]:
return a if condition else b
builder.add
conditionaledges(node, router)
命令 - 当结合路由与更新时
def node(state) -> Command:
return Command(goto=next, update={step: state[step] + 1})
静态 vs 动态路由
静态边(add_edge):
- - 构建时已知的固定流程
- 更清晰的图可视化
- 更容易推理
动态路由(addconditionaledges、Command、Send):
持久化策略
检查点选择
| 检查点 | 用例 | 特性 |
|---|
| InMemorySaver | 仅测试 | 重启后丢失 |
| SqliteSaver |
开发 | 单文件,本地 |
| PostgresSaver | 生产 | 可扩展,并发 |
| 自定义 | 特殊需求 | 实现BaseCheckpointSaver |
检查点范围
python
完全持久化(默认)
graph = builder.compile(checkpointer=checkpointer)
子图选项
subgraph = sub_builder.compile(
checkpointer=None, # 从父图继承
checkpointer=True, # 独立检查点
checkpointer=False, # 无检查点(原子运行)
)
何时禁用检查点
- - 应原子运行的短生命周期子图
- 状态模式不兼容的子图
- 无需恢复的性能关键路径
多智能体架构
监督者模式
最适合:
┌─────────────┐
│ 监督者 │
└──────┬──────┘
┌────────┬───┴───┬────────┐
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│智能体1│ │智能体2│ │智能体3│ │智能体4│
└──────┘ └──────┘ └──────┘ └──────┘
对等模式
最适合:
┌──────┐ ┌──────┐
│智能体1│◄───►│智能体2│
└──┬───┘ └───┬──┘
│ │
▼ ▼
┌──────┐ ┌──────┐
│智能体3│◄───►│智能体4│
└──────┘ └──────┘
交接模式
最适合:
┌────────┐ ┌────────┐ ┌────────┐
│研究 │───►│规划 │───►│执行 │
└────────┘ └────────┘ └────────┘
流式传输策略
流模式选择
| 模式 | 用例 | 数据 |
|---|
| updates | UI更新 | 仅节点输出 |
| values |
状态检查 | 每一步的完整状态 |
| messages | 聊天用户体验 | LLM令牌 |
| custom | 进度/日志 | 通过StreamWriter的自定义数据 |
| debug | 调试 | 任务 + 检查点 |
子图流式传输
python
从子图流式传输
async for chunk in graph.astream(
input,
stream_mode=updates,
subgraphs=True # 包含子图事件
):
namespace, data = chunk # namespace表示深度
人在回路中设计
中断位置
| 策略 | 用例 |
|---|
| interruptbefore | 操作前审批 |
| interruptafter |
完成后审查 |
| 节点内interrupt() | 动态、上下文暂停 |
恢复模式
python
简单恢复(同一线程)
graph.invoke(None, config)
带值恢复
graph.invoke(Command(resume=approved), config)
恢复特定中断
graph.invoke(Command(resume={interrupt_id: value}), config)
修改状态并恢复
graph.update
state(config, {field: newvalue})
graph.invoke(None, config)
错误处理策略
重试配置
python
每节点重试
RetryPolicy(
initial_interval=0.5,
backoff_factor=2.0,
max_interval=60.0,
max_attempts=3,
retry_on=lambda e: isinstance(e, (APIError, TimeoutError))
)
多个策略(首个匹配获胜)
builder.add
node(node, fn, retrypolicy=[
RetryPolicy(retry
on=RateLimitError, maxattempts=5),
RetryPolicy(retry
on=Exception, maxattempts=2),
])
回退模式
python
def nodewithfallback(state):
try:
return primary_operation(state)
except PrimaryError:
return fallback_operation(state)
或使用条件边进行复杂回退路由
def route
onerror