Postmortems
A good postmortem learns without blaming individuals. It produces owned actions that reduce recurrence or improve detection—not generic “communicate better” platitudes.
When to Offer This Workflow
Trigger conditions:
- - SEV incidents, customer-visible outages, data-loss scares
- Near-misses worth documenting
- Need facilitation structure in a blame-prone culture
Initial offer:
Use six stages: (1) scope & audience, (2) timeline & impact, (3) root cause analysis, (4) what worked / didn’t, (5) action items, (6) communication & follow-up). Confirm internal-only vs customer-facing summary.
Stage 1: Scope & Audience
Goal: Define readers (exec, engineering, CS) and redact PII or sensitive security details.
Practices
- - Blameless framing in the invite and template
Exit condition: Template chosen; owner for the final document.
Stage 2: Timeline & Impact
Goal: Minute-resolution timeline in UTC: detection → onset → mitigation → resolution.
Impact
- - Users affected, duration, data integrity if relevant, SLA breach
Exit condition: Facts align with any external customer communication.
Stage 3: Root Cause Analysis
Goal: Use five whys or fishbone as tools, not rituals. Separate root cause (fix that stops the class of failure) from contributing factors (process gaps, missing tests).
Practices
- - Do not name an individual as the “root cause”
Exit condition: Evidence-backed causal chain; contributing factors listed.
Stage 4: What Worked / Didn’t
Goal: Reinforce positives (runbooks followed, clear comms) and negatives (missing dashboards, slow escalation).
Stage 5: Action Items
Goal: Specific tickets with owners and dates; categorize prevent / detect / recover / process.
Practices
- - Avoid vague “add monitoring”—name metrics or signals
Exit condition: Items linked in the issue tracker.
Stage 6: Communication & Follow-Up
Goal: Share summary internally; external postmortem only when policy requires; track completion in 30/60 days.
Final Review Checklist
- - [ ] Blameless tone; timeline and facts accurate
- [ ] Impact quantified where possible
- [ ] Root cause vs contributing factors distinguished
- [ ] Action items owned, dated, tracked
- [ ] Follow-up review scheduled
Tips for Effective Guidance
- - Match depth to severity; lightweight retro for minor incidents.
- Link traces, metrics, and logs in an appendix for engineers.
- Psychological safety enables honesty—leadership models it.
Handling Deviations
- - Security incidents: coordinate with legal/infosec before public detail.
事后复盘
优秀的事后复盘旨在学习而非归咎个人。它会产生可执行的行动项,以减少问题复发或提升检测能力——而非加强沟通这类泛泛之谈。
何时提供此工作流程
触发条件:
- - SEV级事故、客户可见的中断、数据丢失风险
- 值得记录的险情
- 在易归咎文化中需要引导框架
初始建议:
采用六个阶段:(1) 范围与受众、(2) 时间线与影响、(3) 根因分析、(4) 有效/无效措施、(5) 行动项、(6) 沟通与跟进。确认是内部专用还是面向客户的总结。
阶段1:范围与受众
目标: 明确读者(高管、工程团队、客服团队),并隐去个人身份信息或敏感安全细节。
实践要点
完成条件: 选定模板;确定最终文档负责人。
阶段2:时间线与影响
目标: 以UTC时间为基准的分钟级时间线:发现 → 开始 → 缓解 → 解决。
影响评估
- - 受影响用户数、持续时间、数据完整性(如相关)、SLA违规情况
完成条件: 事实与任何外部客户沟通内容保持一致。
阶段3:根因分析
目标: 将五个为什么或鱼骨图作为工具而非形式。区分根因(能阻止该类故障的修复措施)与促成因素(流程漏洞、缺失的测试)。
实践要点
完成条件: 有证据支撑的因果链;列出促成因素。
阶段4:有效/无效措施
目标: 强化正面措施(遵循操作手册、沟通清晰)与负面措施(缺失仪表盘、升级缓慢)。
阶段5:行动项
目标: 创建带有负责人和截止日期的具体工单;分类为预防/检测/恢复/流程。
实践要点
完成条件: 工单已关联至问题追踪系统。
阶段6:沟通与跟进
目标: 内部共享摘要;仅在政策要求时发布外部复盘;在30/60天内跟踪完成情况。
最终审核清单
- - [ ] 无责语气;时间线和事实准确
- [ ] 尽可能量化影响
- [ ] 区分根因与促成因素
- [ ] 行动项有负责人、截止日期和跟踪状态
- [ ] 已安排跟进审查
有效指导技巧
- - 根据严重程度调整深度;轻微事故采用轻量级复盘。
- 在附录中为工程师提供相关追踪、指标和日志链接。
- 心理安全感促进坦诚——领导层需以身作则。
异常处理
- - 安全事件:在公开细节前需与法务/信息安全部门协调。