Serverless (Deep Workflow)
Serverless shifts complexity to permissions, quotas, observability, and state at the edges. Guide the user to explicit trade-offs: simplicity vs cold starts, synchronous vs async, and least privilege IAM that is still operable.
When to Offer This Workflow
Trigger conditions:
- - Choosing between containers vs functions, or decomposing a service into functions
- Cold starts, timeouts, memory sizing, or concurrency throttling
- “Works locally, fails in Lambda”—IAM, VPC, DNS, or env differences
- Cost spikes, recursive invocation, or DLQ backlogs
Initial offer:
Use six stages: (1) workload fit & constraints, (2) triggers & contract, (3) IAM & networking, (4) runtime performance, (5) observability & ops, (6) cost & governance. Confirm cloud and language/runtime.
Stage 1: Workload Fit & Constraints
Goal: Decide if functions are appropriate and what boundaries look like.
Fit Criteria (heuristics)
- - Good: event-driven, spiky traffic, small well-defined units, short execution, state externalized
- Hard: long CPU-heavy jobs, large in-memory state, strict low-latency p99 without provisioned concurrency, complex socket protocols
Clarify
- - SLAs: sync API vs async pipeline
- Payload limits, execution time cap, tmp storage
- Stateful needs: DB, queue, cache, workflow engine
Exit condition: Clear yes/no/partial with escape hatch (container, batch, ECS/Fargate, Step Functions).
Stage 2: Triggers & Contract
Goal: Define inputs, idempotency, retry semantics, and output side effects.
Map
- - Triggers: HTTP, queue, schedule, object storage, streams, webhooks
- At-least-once delivery → idempotent handlers and dedupe keys
- Partial failure in batch: what gets retried vs poison messages
Design
- - Event schema versioning; backward-compatible consumers
- DLQ or failed-letter path with replay procedure
Exit condition: Written contract: success criteria, retry policy, dead-letter ownership.
Stage 3: IAM & Networking
Goal: Least privilege that is debuggable; correct VPC when needed.
IAM
- - One role per function family; resource-scoped policies
- Avoid
* actions on * resources except where cloud forces it—then narrow ASAP - Cross-account and KMS decrypt permissions explicit
Networking
- - Public vs VPC-attached functions (cold start + ENI trade-offs)
- Egress for third-party APIs: NAT costs and security groups / NACLs
- Private API Gateway / internal ALB patterns if applicable
Exit condition: IAM policy review with least privilege checklist; network path diagram for dependencies.
Stage 4: Runtime Performance
Goal: Meet latency and throughput within platform limits.
Tactics
- - Memory tuning: CPU scales with memory on many clouds—profile
- Provisioned concurrency / min instances for critical sync paths—cost trade-off
- Connection reuse (HTTP, DB) outside handler global where safe
- Cold start: trim dependencies, ARM Graviton if supported, lazy init discipline
- Timeouts set below client expectations; avoid infinite hangs
Concurrency
- - Reserved concurrency vs account limits; avoid starving other functions
Exit condition: Load test or trace evidence for p95/p99; documented limits and mitigations.
Stage 5: Observability & Operations
Goal: Debuggable serverless—correlation across async hops.
Practices
- - Structured logging with request IDs; PII redaction
- Tracing (X-Ray, OpenTelemetry) across queue → function → DB
- Metrics: throttles, errors, duration, iterator age for streams
- Alarms on error rate, DLQ depth, duration approaching timeout
Runbooks
- - Replay DLQ safely (idempotency!)
- Blue/green or canary if using traffic shifting features
Exit condition: Dashboard + alerts + on-call steps for top failure modes.
Stage 6: Cost & Governance
Goal: Predictable spend and guardrails.
Levers
- - Right-size memory; eliminate unnecessary VPC; async where sync not needed
- Recursive patterns and accidental infinite loops—billing alerts
- Tagging for cost allocation; budgets and anomaly detection
Governance
- - Approved runtimes; dependency scanning; org-level deny policies for public buckets, etc.
Final Review Checklist
- - [ ] Workload fit validated; boundaries documented
- [ ] Idempotency + DLQ + replay story clear
- [ ] IAM minimal; network path understood
- [ ] Latency/cold start addressed for critical paths
- [ ] Observability and alarms in place
- [ ] Cost and recursion risks acknowledged
Tips for Effective Guidance
- - Always state at-least-once and what breaks if handlers are not idempotent.
- When user says “Lambda slow,” separate cold start vs downstream vs code.
- Prefer Step Functions / workflows when logic is long-running branching—not nested Lambdas calling Lambdas ad hoc.
Handling Deviations
- - “We only have one function”: still document IAM, retries, and logs—future you will thank you.
- Edge workers: emphasize CPU time limits, geography, and cache semantics.
无服务器(深度工作流)
无服务器将复杂性转移至权限、配额、可观测性和边缘状态。引导用户做出明确的权衡:简单性与冷启动、同步与异步,以及仍可操作的最小权限IAM。
何时提供此工作流
触发条件:
- - 在容器与函数之间选择,或将服务拆解为函数
- 冷启动、超时、内存大小或并发限制
- “本地运行正常,Lambda 中失败”——IAM、VPC、DNS 或环境差异
- 成本激增、递归调用或死信队列积压
初始提供:
使用六个阶段:(1) 工作负载适配与约束,(2) 触发器与契约,(3) IAM 与网络,(4) 运行时性能,(5) 可观测性与运维,(6) 成本与治理。确认云平台和语言/运行时。
阶段 1:工作负载适配与约束
目标: 确定函数是否适用,以及边界应如何定义。
适配标准(启发式)
- - 适合:事件驱动、流量突发、小型明确定义的单元、短执行时间、状态外部化
- 困难:长时间 CPU 密集型任务、大内存状态、无预置并发下的严格低延迟 p99、复杂套接字协议
明确事项
- - SLA:同步 API 与异步管道
- 负载限制、执行时间上限、临时存储
- 有状态需求:数据库、队列、缓存、工作流引擎
退出条件: 明确是/否/部分,并附带逃生通道(容器、批处理、ECS/Fargate、Step Functions)。
阶段 2:触发器与契约
目标: 定义输入、幂等性、重试语义和输出副作用。
映射
- - 触发器:HTTP、队列、定时任务、对象存储、流、Webhook
- 至少一次投递 → 幂等处理器和去重键
- 批处理中的部分失败:哪些需要重试,哪些是毒消息
设计
- - 事件模式版本控制;向后兼容的消费者
- 死信队列或失败消息路径,附带重放流程
退出条件: 书面契约:成功标准、重试策略、死信所有权。
阶段 3:IAM 与网络
目标: 可调试的最小权限;必要时配置正确的VPC。
IAM
- - 每个函数族一个角色;资源范围限定的策略
- 避免对 资源使用 操作,除非云平台强制要求——然后尽快缩小范围
- 跨账户和KMS解密权限需明确声明
网络
- - 公共函数与VPC 附加函数(冷启动 + ENI 权衡)
- 第三方 API 的出站:NAT 成本和安全组/NACL
- 如适用,私有API Gateway/内部 ALB 模式
退出条件: IAM 策略审查附带最小权限清单;依赖项的网络路径图。
阶段 4:运行时性能
目标: 在平台限制内满足延迟和吞吐量要求。
策略
- - 内存调优:许多云平台上 CPU 随内存扩展——进行性能分析
- 关键同步路径的预置并发/最小实例——成本权衡
- 在全局作用域外安全地复用连接(HTTP、数据库)
- 冷启动:精简依赖项,如支持则使用 ARM Graviton,惰性初始化规范
- 超时设置低于客户端预期;避免无限挂起
并发
退出条件: 负载测试或p95/p99的追踪证据;记录限制和缓解措施。
阶段 5:可观测性与运维
目标: 可调试的无服务器——跨异步跳转的关联。
实践
- - 带请求 ID 的结构化日志;PII脱敏
- 跨队列→函数→数据库的追踪(X-Ray、OpenTelemetry)
- 指标:节流、错误、持续时间、流的迭代器年龄
- 错误率、死信队列深度、持续时间接近超时的告警
运行手册
- - 安全重放死信队列(幂等性!)
- 如使用流量切换功能,采用蓝绿或金丝雀部署
退出条件: 仪表盘 + 告警 + 针对主要故障模式的值班步骤。
阶段 6:成本与治理
目标: 可预测的支出和护栏。
杠杆
- - 合理调整内存大小;消除不必要的 VPC;非必要同步时使用异步
- 递归模式和意外无限循环——计费告警
- 用于成本分配的标签;预算和异常检测
治理
- - 批准的运行时;依赖项扫描;组织级拒绝策略(如公共存储桶等)
最终审查清单
- - [ ] 工作负载适配已验证;边界已记录
- [ ] 幂等性 + 死信队列 + 重放方案清晰
- [ ] IAM 最小化;网络路径已理解
- [ ] 关键路径的延迟/冷启动已处理
- [ ] 可观测性和告警已就位
- [ ] 成本和递归风险已确认
有效指导技巧
- - 始终说明至少一次投递,以及如果处理器不幂等会出什么问题。
- 当用户说“Lambda 慢”时,区分冷启动、下游和代码。
- 当逻辑是长时间运行的分支时,优先使用Step Functions/工作流——而不是临时嵌套 Lambda 调用 Lambda。
处理偏差
- - “我们只有一个函数”:仍然记录 IAM、重试和日志——未来的你会感谢自己。
- 边缘工作者:强调CPU 时间限制、地理位置和缓存语义。