Message Queues (Deep Workflow)
Queues decouple producers and consumers—and introduce duplicates, ordering surprises, poison messages, and lag. Make delivery semantics and failure handling explicit.
When to Offer This Workflow
Trigger conditions:
- - Choosing Kafka vs RabbitMQ vs SQS vs Pub/Sub
- Consumer lag, DLQ growth, reprocessing needs
- “Exactly-once” expectations—need alignment on reality
Initial offer:
Use six stages: (1) delivery semantics, (2) topology & partitions, (3) message contract, (4) consumers & retries, (5) ops & scaling, (6) failure drills). Confirm cloud and ordering requirements.
Stage 1: Delivery Semantics
Goal: Choose at-most-once, at-least-once, or effective-once via idempotency.
Questions
- 1. Can duplicate processing break invariants?
- Is ordering global, per-entity, or unnecessary?
- Latency vs durability trade-offs
Exit condition: One paragraph per pipeline stating semantics.
Stage 2: Topology & Partitions
Goal: Throughput and ordering align—ordering only within partition when using Kafka-style partitions.
Practices
- - Partition key often equals business key (e.g., user id)
- Watch hot partitions
Stage 3: Message Contract
Goal: Versioned events or commands with schema registry.
Practices
- - Envelope: id, type, version, timestamp
- Payload size limits; reference blobs by id
Stage 4: Consumers & Retries
Goal: Exponential backoff + jitter; DLQ with reason; replay tooling owned.
Pitfalls
- - Retries can reorder unless single-threaded per partition
Stage 5: Ops & Scaling
Goal: Lag metrics, consumer offset health, rebalance awareness (Kafka).
Stage 6: Failure Drills
Goal: Kill consumer mid-batch; duplicate publish intentionally; validate idempotency.
Final Review Checklist
- - [ ] Delivery semantics and idempotency explicit
- [ ] Partitioning/ordering strategy documented
- [ ] Versioned message contract
- [ ] Retry, DLQ, replay documented
- [ ] Lag metrics and alerts; capacity plan
Tips for Effective Guidance
- - Exactly-once end-to-end is rare—design for at-least-once + idempotent handlers.
- Challenge global ordering requirements—they cost scale.
- Visibility timeouts (SQS) differ by product—read the vendor docs.
Handling Deviations
- - Transactional outbox when you need DB + queue consistency without dual writes.
消息队列(深度工作流)
队列解耦了生产者和消费者——同时也引入了重复消息、顺序异常、毒消息和延迟。请明确投递语义和故障处理机制。
何时提供此工作流
触发条件:
- - 在Kafka、RabbitMQ、SQS、Pub/Sub之间进行选择
- 消费者延迟、死信队列增长、需要重新处理
- “精确一次”期望——需对齐实际情况
初始建议:
采用六个阶段:(1)投递语义,(2)拓扑结构与分区,(3)消息契约,(4)消费者与重试,(5)运维与扩展,(6)故障演练。确认云环境和排序需求。
阶段一:投递语义
目标: 选择至多一次、至少一次,或通过幂等性实现有效一次。
问题
- 1. 重复处理是否会破坏不变量?
- 排序是全局、按实体还是不需要?
- 延迟与持久性的权衡
退出条件: 每个管道用一段话说明其语义。
阶段二:拓扑结构与分区
目标: 吞吐量与排序对齐——使用Kafka风格分区时,仅在分区内保证顺序。
实践
- - 分区键通常等于业务键(例如用户ID)
- 关注热点分区
阶段三:消息契约
目标: 使用模式注册表的版本化事件或命令。
实践
- - 信封结构:ID、类型、版本、时间戳
- 负载大小限制;通过ID引用二进制大对象
阶段四:消费者与重试
目标: 指数退避+抖动;死信队列附带原因;拥有重放工具。
陷阱
阶段五:运维与扩展
目标: 延迟指标、消费者偏移健康度、再平衡感知(Kafka)。
阶段六:故障演练
目标: 在批次处理中杀死消费者;故意重复发布;验证幂等性。
最终审查清单
- - [ ] 投递语义和幂等性明确
- [ ] 分区/排序策略已记录
- [ ] 版本化消息契约
- [ ] 重试、死信队列、重放机制已记录
- [ ] 延迟指标和告警;容量规划
有效指导技巧
- - 端到端精确一次很少见——设计为至少一次+幂等处理器。
- 质疑全局排序需求——它们会牺牲扩展性。
- 可见性超时(SQS)因产品而异——请阅读供应商文档。
处理偏差
- - 当需要数据库与队列一致性且避免双写时,使用事务性发件箱模式。