System Design (Deep Workflow)
System design is structured decision-making under constraints. The output is not a diagram—it is clarity on requirements, explicit trade-offs, and a path to evolve when load and features change.
When to Offer This Workflow
Trigger conditions:
- - “Design Twitter/Instagram/WhatsApp” (interview style)
- Greenfield service, major scale milestone, multi-region, or realtime needs
- Refactoring monolith—boundaries and data ownership questions
Initial offer:
Use seven stages: (1) clarify requirements, (2) capacity & SLO sketch, (3) high-level architecture, (4) data model & storage, (5) APIs & traffic patterns, (6) reliability & failure modes, (7) trade-offs & evolution. Ask interview mode (time-boxed) vs real project (depth).
Stage 1: Clarify Requirements
Goal: Functional and non-functional requirements explicit.
Functional
- - Core user actions; read vs write ratio; search, ranking, notifications?
Non-functional
- - Scale: DAU, QPS, data size, growth—orders of magnitude OK if unknown
- Latency: p95/p99 targets; sync vs async acceptable?
- Consistency: can reads be stale? global ordering needed?
- Durability: loss tolerance; audit; compliance
Out of Scope
- - Explicitly list non-goals to prevent scope creep in interviews and real life
Exit condition: Problem statement one paragraph; constraints bullet list.
Stage 2: Capacity & SLO Sketch
Goal: Back-of-envelope math to sanity-check bottlenecks.
Rough math
- - Requests/day → QPS peak with 3–10× factor if needed
- Storage/day; replication multiplier
- Bandwidth for large payloads (images, video)
SLO mindset
- - Availability vs cost; strong consistency vs latency
Exit condition: Identified likely bottleneck class: DB, network, fan-out, storage.
Stage 3: High-Level Architecture
Goal: Boxes and arrows with reasons.
Typical layers
- - Clients → LB/API → services → caches/queues → databases/object storage
- CDN for static and cacheable API responses when applicable
- Async processing for heavy work (indexing, emails, ML)
Principles
- - Separation of read/write (CQRS) only when justified by scale
- Idempotent workers; at-least-once messaging assumptions
Exit condition: Diagram + why not simpler (monolith) answered in one paragraph.
Stage 4: Data Model & Storage
Goal: Choose stores for access patterns, not buzzwords.
Questions
- - Relational vs document vs wide-column vs graph—query patterns first
- Sharding key if huge scale; hot partitions risk
- Caching: what, TTL, invalidation
- Search: inverted index service (Elasticsearch, etc.) vs DB full-text
Consistency
- - Transactions boundaries; sagas for cross-service consistency; eventual where OK
Exit condition: Schema sketch or entity list; read/write paths for top 3 operations.
Stage 5: APIs & Traffic Patterns
Goal: Interface design and operational behavior.
REST vs RPC vs GraphQL
- - Trade-offs: coupling, overfetching, caching, team boundaries
Realtime
- - WebSockets/SSE; presence; ordering; backpressure
Rate limiting & auth
- - Gateway enforcement; user vs service identity
Exit condition: Example APIs or events for core flows; pagination strategy.
Stage 6: Reliability & Failure Modes
Goal: Failure is normal—design degradation.
Consider
- - Retries with backoff; timeouts everywhere; circuit breakers
- Partial outages: read-only mode, stale cache, queue backlog
- Disaster: backup/restore, multi-region (active-active vs DR)
Observability
- - Metrics, logs, traces; SLOs for critical paths
Exit condition: Top 5 failure scenarios + mitigation each.
Stage 7: Trade-offs & Evolution
Goal: Show maturity—v1 vs v2 path.
Articulate
- - What you build first vs later; feature flags; strangler patterns
- Interview: summarize bottleneck and future scaling in 60 seconds
Final Review Checklist
- - [ ] Requirements and non-goals clear
- [ ] Rough capacity points to bottleneck
- [ ] Architecture justified vs simpler alternatives
- [ ] Data stores match access patterns + consistency needs
- [ ] APIs/events and failure modes addressed
- [ ] Evolution path stated
Tips for Effective Guidance
- - Interview: time-box depth—breadth first, then zoom one area on request.
- Always mention hot keys, fan-out, and backpressure for scale.
- Distinguish exactly-once myth—usually at-least-once + idempotency.
Handling Deviations
- - Small system: still run stages lightly—habit prevents over-engineering later.
- Existing system: focus on incremental changes and data migration risks.
系统设计(深度工作流)
系统设计是在约束条件下进行结构化决策。其产出不是一张图表,而是明确的需求、清晰的权衡,以及一条在负载和功能变化时可演进的路径。
何时提供此工作流
触发条件:
- - “设计 Twitter/Instagram/WhatsApp”(面试风格)
- 全新服务、重大规模里程碑、多区域或实时需求
- 重构单体架构——涉及边界和数据所有权问题
初始建议:
使用七个阶段:(1) 明确需求,(2) 容量与 SLO 草图,(3) 高层架构,(4) 数据模型与存储,(5) API 与流量模式,(6) 可靠性及故障模式,(7) 权衡与演进。询问是面试模式(限时)还是真实项目(深度)。
阶段 1:明确需求
目标: 明确功能性和非功能性需求。
功能性
- - 核心用户操作;读写比;是否需要搜索、排序、通知?
非功能性
- - 规模:DAU、QPS、数据量、增长——若未知,数量级估算即可
- 延迟:p95/p99 目标;同步还是异步可接受?
- 一致性:读取可否容忍过期?是否需要全局排序?
- 持久性:数据丢失容忍度;审计;合规要求
非目标范围
- - 明确列出非目标,防止在面试和实际项目中范围蔓延
退出条件: 问题陈述一段话;约束条件列表。
阶段 2:容量与 SLO 草图
目标: 通过粗略估算验证瓶颈的合理性。
粗略计算
- - 请求数/天 → 峰值 QPS,必要时乘以 3–10 倍因子
- 存储量/天;复制倍数
- 大负载(图片、视频)的带宽
SLO 思维
退出条件: 识别出可能的瓶颈类型:数据库、网络、扇出、存储。
阶段 3:高层架构
目标: 带有理由的方框和箭头。
典型分层
- - 客户端 → 负载均衡/API → 服务 → 缓存/队列 → 数据库/对象存储
- 适用时使用 CDN 处理静态和可缓存 API 响应
- 对繁重工作(索引、邮件、机器学习)采用异步处理
原则
- - 仅在规模证明必要时才进行读写分离(CQRS)
- 幂等工作器;至少一次消息传递假设
退出条件: 图表 + 用一段话回答为什么不采用更简单的方案(单体架构)。
阶段 4:数据模型与存储
选择符合访问模式的存储,而非追逐流行词。
问题
- - 关系型 vs 文档型 vs 宽列型 vs 图型——优先考虑查询模式
- 大规模时的分片键;热点分区风险
- 缓存:缓存内容、TTL、失效策略
- 搜索:倒排索引服务(Elasticsearch 等)vs 数据库全文搜索
一致性
- - 事务边界;跨服务一致性使用Saga;可接受处使用最终一致性
退出条件: 模式草图或实体列表;前 3 个操作的读写路径。
阶段 5:API 与流量模式
目标: 接口设计和运行行为。
REST vs RPC vs GraphQL
实时
- - WebSocket/SSE;在线状态;排序;背压
限流与认证
退出条件: 核心流程的示例API或事件;分页策略。
阶段 6:可靠性及故障模式
目标: 故障是常态——设计降级方案。
考虑
- - 带退避的重试;超时无处不在;断路器
- 部分故障:只读模式、过期缓存、队列积压
- 灾难:备份/恢复、多区域(双活 vs 灾备)
可观测性
退出条件: 前 5 个故障场景 + 每个场景的缓解措施。
阶段 7:权衡与演进
目标: 展示成熟度——v1 与 v2 路径。
阐明
- - 先构建什么vs 后构建;特性开关;绞杀者模式
- 面试:在 60 秒内总结瓶颈和未来扩展
最终检查清单
- - [ ] 需求和非目标清晰
- [ ] 粗略容量指向瓶颈
- [ ] 架构相对于更简单的替代方案有合理理由
- [ ] 数据存储匹配访问模式 + 一致性需求
- [ ] API/事件和故障模式已处理
- [ ] 演进路径已说明
有效指导技巧
- - 面试:对深度进行限时——先广度,然后按需深入某一领域。
- 对于规模问题,始终提及热键、扇出和背压。
- 区分恰好一次的迷思——通常是至少一次 + 幂等性。
处理偏差
- - 小型系统:仍轻量运行各阶段——习惯可防止后续过度设计。
- 现有系统:关注增量变更和数据迁移风险。