WebSocket Patterns (Deep Workflow)

Realtime connections add stateful complexity: who is connected, what order messages arrive, and what happens when links flap. Design for at-least-once delivery, explicit heartbeats, and horizontal scaling early.

When to Offer This Workflow

Trigger conditions:

- Replacing polling with WS or SSE
Auth on connect; token refresh mid-session
Fan-out to many subscribers; presence and typing indicators
Sticky sessions, load balancer timeouts, reconnect storms

Initial offer:

Use six stages: (1) choose transport, (2) connection & auth, (3) protocol & messages, (4) reliability & ordering, (5) scale & ops, (6) security & abuse). Confirm browser vs server clients and proxies (nginx, ALB, Cloudflare).

Stage 1: Choose Transport

Goal: WebSocket vs SSE vs long polling—right tool per direction.

Heuristics

- Bidirectional, low latency, binary payloads → WebSocket
Server → client one-way streams, HTTP-friendly infra → SSE
Fire-and-forget notifications with simple infra → consider push services first

Caveats

- Corporate proxies historically hurt WS—test environments; WSS mandatory
HTTP/3 QUIC stacks differ—validate intermediaries

Exit condition: Transport choice documented with why not alternatives.

Stage 2: Connection & Auth

Goal: Authenticated sockets without long-lived secrets in query strings when avoidable.

Patterns

- JWT in Sec-WebSocket-Protocol or first message after connect—prefer short-lived tokens + refresh flow
Cookie sessions with CSRF considerations on same-site policies
Re-auth before token expiry; graceful close with code and reason

Authorization

- Subscribe to topics only after server-side check—never trust client channel names alone

Exit condition: Auth diagram: issue token → connect → authorize subscriptions.

Stage 3: Protocol & Messages

Goal: Versioned message schema; predictable errors.

Design

- Envelope: { type, id, ts, payload }; correlation ids for RPC-style
Version negotiation on connect or feature flags in hello message
Binary vs JSON—protobuf/msgpack for bandwidth; JSON for debuggability early

Heartbeats

- Ping/pong or application-level heartbeat at interval < proxy timeout (often 30–60s)
Idle detection and clean disconnect

Exit condition: Protocol doc + example session transcript.

Stage 4: Reliability & Ordering

Goal: Define delivery semantics—usually at-least-once over TCP; ordering per channel.

Practices

- Idempotent message handlers; dedupe by message id when retries exist
Per-user sequence numbers if strict order matters
Buffer limits: drop, close, or apply backpressure policy

Reconnect

- Exponential backoff + jitter to prevent thundering herd
Resume from last seen seq if missed messages are unacceptable—persist or snapshot

Exit condition: Reconnect story documented; storm mitigation tested.

Stage 5: Scale & Operations

Goal: Many connections across many nodes—affinity and pub/sub backbone.

Architecture

- Sticky sessions or shared pub/sub (Redis, NATS, Kafka) for cross-node fan-out
Shard connection maps; avoid single giant in-memory map on one box

Observability

- Metrics: active connections, msg/sec, queue depth, disconnect reasons
Tracing: connect → subscribe → first message latency

Load shedding

- Max connections per IP/user; rate limit connection attempts

Exit condition: Capacity model: connections per node × message fan-out cost.

Stage 6: Security & Abuse

Goal: Minimize attack surface on long-lived pipes.

Controls

- WSS everywhere; validate Origin where applicable
Payload size limits; compression bomb awareness
AuthZ on every subscription; audit admin actions

Abuse

- Spam detection; kick/ban flows; circuit breakers on misbehaving clients

Final Review Checklist

- [ ] Transport choice justified (WS/SSE/etc.)
[ ] AuthN/Z on connect and per-channel
[ ] Heartbeats aligned with proxy/LB timeouts
[ ] Delivery/idempotency/reconnect semantics explicit
[ ] Horizontal scale path + observability + abuse controls

Tips for Effective Guidance

- ALB idle timeout vs heartbeat—classic production bug; call it out.
When user says “real-time,” ask latency target and ordering needs.
SSE is simpler—don’t default to WS for one-way feeds.

Handling Deviations

- Edge runtimes (Workers): different connection limits and duration—validate platform.
Mobile: background suspension—push notifications may complement WS.

WebSocket 模式（深度工作流）

实时连接增加了有状态的复杂性：谁已连接、消息按什么顺序到达，以及链路抖动时会发生什么。需尽早设计至少一次投递、显式心跳和水平扩展。

何时提供此工作流

触发条件：

- 用 WS 或 SSE 替代轮询
连接时进行认证；会话期间刷新令牌
扇出到多个订阅者；在线状态和输入中指示器
粘性会话、负载均衡器超时、重连风暴

初始建议：

使用六个阶段：(1) 选择传输方式，(2) 连接与认证，(3) 协议与消息，(4) 可靠性与顺序，(5) 扩展与运维，(6) 安全与滥用防护。确认浏览器端与服务端客户端以及代理（nginx、ALB、Cloudflare）。

阶段 1：选择传输方式

目标： WebSocket 与 SSE 与 长轮询——根据方向选择合适工具。

启发式规则

- 双向、低延迟、二进制负载 → WebSocket
服务端 → 客户端的单向流、兼容 HTTP 的基础设施 → SSE
使用简单基础设施的即发即弃通知 → 优先考虑推送服务

注意事项

- 企业代理历史上对 WS 不友好——需测试环境；WSS 为强制要求
HTTP/3 QUIC 协议栈存在差异——需验证中间件

退出条件： 传输方式选择已记录，并说明为何不选其他方案。

阶段 2：连接与认证

目标： 建立已认证的套接字，尽可能避免在查询字符串中暴露长期有效的密钥。

模式

- 在 Sec-WebSocket-Protocol 或连接后的首条消息中携带 JWT——优先使用短生命周期令牌 + 刷新流程
使用 Cookie 会话，需考虑同站策略下的 CSRF 问题
在令牌过期前重新认证；使用状态码和原因进行优雅关闭

授权

- 仅在服务端检查后订阅特定主题——绝不仅信任客户端提供的频道名称

退出条件： 认证流程图：签发令牌 → 建立连接 → 授权订阅。

阶段 3：协议与消息

目标： 版本化的消息模式；可预测的错误处理。

设计

- 信封结构：{ type, id, ts, payload }；为 RPC 风格使用关联ID
连接时进行版本协商，或在握手消息中使用功能标志
二进制与 JSON 的选择——带宽敏感场景用 protobuf/msgpack；早期调试用 JSON

心跳

- 使用 Ping/Pong 或应用层心跳，间隔小于代理超时时间（通常为 30–60s）
空闲检测和干净断开

退出条件： 协议文档 + 示例会话记录。

阶段 4：可靠性与顺序

目标： 定义投递语义——通常基于 TCP 实现至少一次投递；按频道保证顺序。

实践

- 幂等的消息处理器；存在重试时通过消息 ID 进行去重
如果严格顺序重要，使用每用户序列号
缓冲区限制：丢弃、关闭或应用背压策略

重连

- 指数退避 + 随机抖动，防止惊群效应
如果丢失消息不可接受，从最后看到的序列号开始恢复——持久化或快照

退出条件： 重连方案已记录；风暴缓解措施已测试。

阶段 5：扩展与运维

目标： 在多个节点上处理大量连接——亲和性和发布/订阅骨干网。

架构

- 使用粘性会话或共享的发布/订阅系统（Redis、NATS、Kafka）实现跨节点扇出
分片连接映射；避免在单台机器上使用单一巨型内存映射

可观测性

- 指标：活跃连接数、消息/秒、队列深度、断开原因
追踪：连接 → 订阅 → 首条消息延迟

负载削减

- 每 IP/用户最大连接数；限流连接尝试

退出条件： 容量模型：每节点连接数 × 消息扇出成本。

阶段 6：安全与滥用防护

目标： 最小化长连接管道的攻击面。

控制措施

- 全面使用 WSS；在适用时验证 Origin
负载大小限制；防范压缩炸弹
对每个订阅进行授权；审计 管理员操作

滥用防护

- 垃圾信息检测；踢出/封禁流程；对行为异常的客户端使用熔断器

最终审查清单

- [ ] 传输方式选择已论证（WS/SSE 等）
[ ] 连接时和每频道的认证/授权
[ ] 心跳与代理/负载均衡器超时对齐
[ ] 投递/幂等/重连语义明确
[ ] 水平扩展路径 + 可观测性 + 滥用防护

有效指导技巧

- ALB 空闲超时与心跳——经典生产环境 Bug，需明确指出。
当用户提到实时时，询问延迟目标和顺序需求。
SSE 更简单——对于单向数据流，不要默认选择 WS。

处理偏差情况

- 边缘运行时（Workers）：不同的连接限制和持续时间——需验证平台。
移动端：后台挂起——推送通知可作为 WS 的补充。

websocket-patternsWebSocket模式