Postmortems (Deep Workflow)
A good postmortem learns without blaming individuals. It produces owned actions that reduce recurrence or improve detection—not a generic “we will communicate better.”
When to Offer This Workflow
Trigger conditions:
- - SEV incidents, customer-visible outages, data loss scares
- Near-miss worth documenting (luck prevented impact)
- Blame culture risk—need facilitation structure
Initial offer:
Use six stages: (1) scope & audience, (2) timeline & impact, (3) root cause analysis, (4) what worked / didn’t, (5) action items, (6) communication & follow-up). Confirm internal-only vs customer-facing summary.
Stage 1: Scope & Audience
Goal: Readers (exec, eng, CS) and sensitivity (PII, security details redacted).
Practices
- - Blameless framing in invite and template
Exit condition: Template chosen; owner for final doc.
Stage 2: Timeline & Impact
Goal: Minute-resolution timeline with UTC; detection vs start vs mitigation vs resolution.
Impact
- - Users affected, duration, data integrity if relevant, SLA breach
Exit condition: Customer communication aligned with facts here.
Stage 3: Root Cause Analysis
Goal: Five whys or fishbone as tool, not ritual—root cause and contributing factors separate.
Practices
- - Root: fix that stops recurrence class (with evidence)
- Contributors: process, missing tests, alert gaps
Exit condition: No single person named as “root cause.”
Stage 4: What Worked / Didn’t
Goal: Reinforce good (runbooks, heroes who followed process) and fix bad (missing dashboards).
Stage 5: Action Items
Goal: Specific, tracked tickets with owners and dates; types: prevent, detect, recover, process.
Practices
- - Avoid vague “add monitoring” without metric names
Exit condition: Action items in issue tracker linked.
Stage 6: Communication & Follow-Up
Goal: Share summary with org; review completion in 30/60 days.
Practices
- - External postmortem if customer promise requires
Final Review Checklist
- - [ ] Blameless tone; facts and timeline clear
- [ ] Impact quantified where possible
- [ ] Root cause and contributing factors distinguished
- [ ] Action items owned, dated, and tracked
- [ ] Follow-up review scheduled
Tips for Effective Guidance
- - Severity should match depth of postmortem (lightweight for small incidents).
- Link to metrics and traces in appendix for engineers.
- Psychological safety enables honesty—leadership must model it.
Handling Deviations
- - Security incident: coordinate with legal before public detail.
- Repeated same failure: escalate to architecture or SLO review.
事后复盘(深度工作流)
一次好的事后复盘应学习而不归咎个人。它应产出有主责的行动项,以减少复发或改进检测——而非泛泛的“我们会加强沟通”。
何时提供此工作流
触发条件:
- - 严重事故、客户可见中断、数据丢失风险
- 值得记录的险兆事件(运气避免了影响)
- 存在追责文化风险——需要引导框架
初始提供:
使用六个阶段:(1)范围与受众,(2)时间线与影响,(3)根因分析,(4)有效/无效措施,(5)行动项,(6)沟通与跟进)。确认内部专用与面向客户摘要的区分。
阶段1:范围与受众
目标: 明确读者(高管、工程、客服)和敏感度(PII、安全细节需脱敏)。
实践
退出条件: 选定模板;确定最终文档负责人。
阶段2:时间线与影响
目标: 建立分钟级时间线(使用UTC);区分发现时间、开始时间、缓解时间与解决时间。
影响
- - 受影响的用户数、持续时间、相关数据完整性、SLA违约情况
退出条件: 客户沟通内容与此处事实保持一致。
阶段3:根因分析
目标: 将五个为什么或鱼骨图作为工具而非仪式——根因与促成因素需分开。
实践
- - 根因:能阻止同类问题复发的修复方案(需有证据)
- 促成因素:流程问题、缺失的测试、告警盲区
退出条件: 没有任何个人被列为“根因”。
阶段4:有效/无效措施
目标: 强化好的做法(运行手册、遵循流程的功臣),修复不足之处(缺失的仪表盘)。
阶段5:行动项
目标: 创建具体、可追踪的工单,明确负责人和截止日期;类型包括:预防、检测、恢复、流程改进。
实践
退出条件: 行动项已关联到问题追踪系统。
阶段6:沟通与跟进
目标: 向组织分享摘要;在30/60天内复查完成情况。
实践
最终审查清单
- - [ ] 无责语气;事实与时间线清晰
- [ ] 尽可能量化影响
- [ ] 区分根因与促成因素
- [ ] 行动项有主责、有日期、可追踪
- [ ] 已安排跟进复查
有效指导建议
- - 严重程度应与事后复盘深度匹配(小事件采用轻量级复盘)。
- 在附录中为工程师链接相关指标和追踪数据。
- 心理安全能促进坦诚——领导层必须以身作则。
偏差处理
- - 安全事件:在公开细节前与法务协调。
- 同类故障重复发生:升级至架构评审或SLO审查。