WebSocket Patterns (Deep Workflow)
Realtime connections add stateful complexity: who is connected, what order messages arrive, and what happens when links flap. Design for at-least-once delivery, explicit heartbeats, and horizontal scaling early.
When to Offer This Workflow
Trigger conditions:
- - Replacing polling with WS or SSE
- Auth on connect; token refresh mid-session
- Fan-out to many subscribers; presence and typing indicators
- Sticky sessions, load balancer timeouts, reconnect storms
Initial offer:
Use six stages: (1) choose transport, (2) connection & auth, (3) protocol & messages, (4) reliability & ordering, (5) scale & ops, (6) security & abuse). Confirm browser vs server clients and proxies (nginx, ALB, Cloudflare).
Stage 1: Choose Transport
Goal: WebSocket vs SSE vs long polling—right tool per direction.
Heuristics
- - Bidirectional, low latency, binary payloads → WebSocket
- Server → client one-way streams, HTTP-friendly infra → SSE
- Fire-and-forget notifications with simple infra → consider push services first
Caveats
- - Corporate proxies historically hurt WS—test environments; WSS mandatory
- HTTP/3 QUIC stacks differ—validate intermediaries
Exit condition: Transport choice documented with why not alternatives.
Stage 2: Connection & Auth
Goal: Authenticated sockets without long-lived secrets in query strings when avoidable.
Patterns
- - JWT in Sec-WebSocket-Protocol or first message after connect—prefer short-lived tokens + refresh flow
- Cookie sessions with CSRF considerations on same-site policies
- Re-auth before token expiry; graceful close with code and reason
Authorization
- - Subscribe to topics only after server-side check—never trust client channel names alone
Exit condition: Auth diagram: issue token → connect → authorize subscriptions.
Stage 3: Protocol & Messages
Goal: Versioned message schema; predictable errors.
Design
- - Envelope:
{ type, id, ts, payload }; correlation ids for RPC-style - Version negotiation on connect or feature flags in hello message
- Binary vs JSON—protobuf/msgpack for bandwidth; JSON for debuggability early
Heartbeats
- - Ping/pong or application-level heartbeat at interval < proxy timeout (often 30–60s)
- Idle detection and clean disconnect
Exit condition: Protocol doc + example session transcript.
Stage 4: Reliability & Ordering
Goal: Define delivery semantics—usually at-least-once over TCP; ordering per channel.
Practices
- - Idempotent message handlers; dedupe by message id when retries exist
- Per-user sequence numbers if strict order matters
- Buffer limits: drop, close, or apply backpressure policy
Reconnect
- - Exponential backoff + jitter to prevent thundering herd
- Resume from last seen seq if missed messages are unacceptable—persist or snapshot
Exit condition: Reconnect story documented; storm mitigation tested.
Stage 5: Scale & Operations
Goal: Many connections across many nodes—affinity and pub/sub backbone.
Architecture
- - Sticky sessions or shared pub/sub (Redis, NATS, Kafka) for cross-node fan-out
- Shard connection maps; avoid single giant in-memory map on one box
Observability
- - Metrics: active connections, msg/sec, queue depth, disconnect reasons
- Tracing: connect → subscribe → first message latency
Load shedding
- - Max connections per IP/user; rate limit connection attempts
Exit condition: Capacity model: connections per node × message fan-out cost.
Stage 6: Security & Abuse
Goal: Minimize attack surface on long-lived pipes.
Controls
- - WSS everywhere; validate Origin where applicable
- Payload size limits; compression bomb awareness
- AuthZ on every subscription; audit admin actions
Abuse
- - Spam detection; kick/ban flows; circuit breakers on misbehaving clients
Final Review Checklist
- - [ ] Transport choice justified (WS/SSE/etc.)
- [ ] AuthN/Z on connect and per-channel
- [ ] Heartbeats aligned with proxy/LB timeouts
- [ ] Delivery/idempotency/reconnect semantics explicit
- [ ] Horizontal scale path + observability + abuse controls
Tips for Effective Guidance
- - ALB idle timeout vs heartbeat—classic production bug; call it out.
- When user says “real-time,” ask latency target and ordering needs.
- SSE is simpler—don’t default to WS for one-way feeds.
Handling Deviations
- - Edge runtimes (Workers): different connection limits and duration—validate platform.
- Mobile: background suspension—push notifications may complement WS.
WebSocket 模式(深度工作流)
实时连接增加了有状态的复杂性:谁已连接、消息按什么顺序到达,以及链路抖动时会发生什么。需尽早设计至少一次投递、显式心跳和水平扩展。
何时提供此工作流
触发条件:
- - 用 WS 或 SSE 替代轮询
- 连接时进行认证;会话期间刷新令牌
- 扇出到多个订阅者;在线状态和输入中指示器
- 粘性会话、负载均衡器超时、重连风暴
初始建议:
使用六个阶段:(1) 选择传输方式,(2) 连接与认证,(3) 协议与消息,(4) 可靠性与顺序,(5) 扩展与运维,(6) 安全与滥用防护。确认浏览器端与服务端客户端以及代理(nginx、ALB、Cloudflare)。
阶段 1:选择传输方式
目标: WebSocket 与 SSE 与 长轮询——根据方向选择合适工具。
启发式规则
- - 双向、低延迟、二进制负载 → WebSocket
- 服务端 → 客户端的单向流、兼容 HTTP 的基础设施 → SSE
- 使用简单基础设施的即发即弃通知 → 优先考虑推送服务
注意事项
- - 企业代理历史上对 WS 不友好——需测试环境;WSS 为强制要求
- HTTP/3 QUIC 协议栈存在差异——需验证中间件
退出条件: 传输方式选择已记录,并说明为何不选其他方案。
阶段 2:连接与认证
目标: 建立已认证的套接字,尽可能避免在查询字符串中暴露长期有效的密钥。
模式
- - 在 Sec-WebSocket-Protocol 或连接后的首条消息中携带 JWT——优先使用短生命周期令牌 + 刷新流程
- 使用 Cookie 会话,需考虑同站策略下的 CSRF 问题
- 在令牌过期前重新认证;使用状态码和原因进行优雅关闭
授权
- - 仅在服务端检查后订阅特定主题——绝不仅信任客户端提供的频道名称
退出条件: 认证流程图:签发令牌 → 建立连接 → 授权订阅。
阶段 3:协议与消息
目标: 版本化的消息模式;可预测的错误处理。
设计
- - 信封结构:{ type, id, ts, payload };为 RPC 风格使用关联ID
- 连接时进行版本协商,或在握手消息中使用功能标志
- 二进制与 JSON 的选择——带宽敏感场景用 protobuf/msgpack;早期调试用 JSON
心跳
- - 使用 Ping/Pong 或应用层心跳,间隔小于代理超时时间(通常为 30–60s)
- 空闲检测和干净断开
退出条件: 协议文档 + 示例会话记录。
阶段 4:可靠性与顺序
目标: 定义投递语义——通常基于 TCP 实现至少一次投递;按频道保证顺序。
实践
- - 幂等的消息处理器;存在重试时通过消息 ID 进行去重
- 如果严格顺序重要,使用每用户序列号
- 缓冲区限制:丢弃、关闭或应用背压策略
重连
- - 指数退避 + 随机抖动,防止惊群效应
- 如果丢失消息不可接受,从最后看到的序列号开始恢复——持久化或快照
退出条件: 重连方案已记录;风暴缓解措施已测试。
阶段 5:扩展与运维
目标: 在多个节点上处理大量连接——亲和性和发布/订阅骨干网。
架构
- - 使用粘性会话或共享的发布/订阅系统(Redis、NATS、Kafka)实现跨节点扇出
- 分片连接映射;避免在单台机器上使用单一巨型内存映射
可观测性
- - 指标:活跃连接数、消息/秒、队列深度、断开原因
- 追踪:连接 → 订阅 → 首条消息延迟
负载削减
退出条件: 容量模型:每节点连接数 × 消息 扇出成本。
阶段 6:安全与滥用防护
目标: 最小化长连接管道的攻击面。
控制措施
- - 全面使用 WSS;在适用时验证 Origin
- 负载大小限制;防范压缩 炸弹
- 对每个订阅进行授权;审计 管理员操作
滥用防护
- - 垃圾信息检测;踢出/封禁流程;对行为异常的客户端使用熔断器
最终审查清单
- - [ ] 传输方式选择已论证(WS/SSE 等)
- [ ] 连接时和每频道的认证/授权
- [ ] 心跳与代理/负载均衡器超时对齐
- [ ] 投递/幂等/重连语义明确
- [ ] 水平扩展路径 + 可观测性 + 滥用防护
有效指导技巧
- - ALB 空闲超时与心跳——经典生产环境 Bug,需明确指出。
- 当用户提到实时时,询问延迟目标和顺序需求。
- SSE 更简单——对于单向数据流,不要默认选择 WS。
处理偏差情况
- - 边缘运行时(Workers):不同的连接限制和持续时间——需验证平台。
- 移动端:后台 挂起——推送通知可作为 WS 的补充。