Incident Commander Skill

Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026

Overview

The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, timeline reconstruction, and thorough post-incident analysis.

Key Features

- Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
Communication Templates - Pre-built templates for stakeholder updates and escalations
Runbook Integration - Generate actionable runbooks from incident patterns

Skills Included

Core Tools

1. Incident Classifier (incident_classifier.py)

- Analyzes incident descriptions and outputs severity levels - Recommends response teams and initial actions - Generates communication templates based on severity

2. Timeline Reconstructor (timeline_reconstructor.py)

- Processes timestamped events from multiple sources - Reconstructs chronological incident timeline - Identifies gaps and provides duration analysis

3. PIR Generator (pir_generator.py)

- Creates comprehensive Post-Incident Review documents - Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline) - Generates actionable follow-up items

Incident Response Framework

Severity Classification System

SEV1 - Critical Outage

Definition: Complete service failure affecting all users or critical business functions

Characteristics:

- Customer-facing services completely unavailable
Data loss or corruption affecting users
Security breaches with customer data exposure
Revenue-generating systems down
SLA violations with financial penalties

Response Requirements:

- Immediate escalation to on-call engineer
Incident Commander assigned within 5 minutes
Executive notification within 15 minutes
Public status page update within 15 minutes
War room established
All hands on deck if needed

Communication Frequency: Every 15 minutes until resolution

SEV2 - Major Impact

Definition: Significant degradation affecting subset of users or non-critical functions

Characteristics:

- Partial service degradation (>25% of users affected)
Performance issues causing user frustration
Non-critical features unavailable
Internal tools impacting productivity
Data inconsistencies not affecting user experience

Response Requirements:

- On-call engineer response within 15 minutes
Incident Commander assigned within 30 minutes
Status page update within 30 minutes
Stakeholder notification within 1 hour
Regular team updates

Communication Frequency: Every 30 minutes during active response

SEV3 - Minor Impact

Definition: Limited impact with workarounds available

Characteristics:

- Single feature or component affected
<25% of users impacted
Workarounds available
Performance degradation not significantly impacting UX
Non-urgent monitoring alerts

Response Requirements:

- Response within 2 hours during business hours
Next business day response acceptable outside hours
Internal team notification
Optional status page update

Communication Frequency: At key milestones only

SEV4 - Low Impact

Definition: Minimal impact, cosmetic issues, or planned maintenance

Characteristics:

- Cosmetic bugs
Documentation issues
Logging or monitoring gaps
Performance issues with no user impact
Development/test environment issues

Response Requirements:

- Response within 1-2 business days
Standard ticket/issue tracking
No special escalation required

Communication Frequency: Standard development cycle updates

Incident Commander Role

Primary Responsibilities

1. Command and Control

- Own the incident response process - Make critical decisions about resource allocation - Coordinate between technical teams and stakeholders - Maintain situational awareness across all response streams

2. Communication Hub

- Provide regular updates to stakeholders - Manage external communications (status pages, customer notifications) - Facilitate effective communication between response teams - Shield responders from external distractions

3. Process Management

- Ensure proper incident tracking and documentation - Drive toward resolution while maintaining quality - Coordinate handoffs between team members - Plan and execute rollback strategies if needed

4. Post-Incident Leadership

- Ensure thorough post-incident reviews are conducted - Drive implementation of preventive measures - Share learnings with broader organization

Decision-Making Framework

Emergency Decisions (SEV1/2):

- Incident Commander has full authority
Bias toward action over analysis
Document decisions for later review
Consult subject matter experts but don't get blocked

Resource Allocation:

- Can pull in any necessary team members
Authority to escalate to senior leadership
Can approve emergency spend for external resources
Make call on communication channels and timing

Technical Decisions:

- Lean on technical leads for implementation details
Make final calls on trade-offs between speed and risk
Approve rollback vs. fix-forward strategies
Coordinate testing and validation approaches

Communication Templates

Initial Incident Notification (SEV1/2)

CODEBLOCK0

Executive Summary (SEV1)

CODEBLOCK1

Customer Communication Template

CODEBLOCK2

Stakeholder Management

Stakeholder Classification

Internal Stakeholders:

- Engineering Leadership - Technical decisions and resource allocation
Product Management - Customer impact assessment and feature implications
Customer Support - User communication and support ticket management
Sales/Account Management - Customer relationship management for enterprise clients
Executive Team - Business impact decisions and external communication approval
Legal/Compliance - Regulatory reporting and liability assessment

External Stakeholders:

- Customers - Service availability and impact communication
Partners - API availability and integration impacts
Vendors - Third-party service dependencies and support escalation
Regulators - Compliance reporting for regulated industries
Public/Media - Transparency for public-facing outages

Communication Cadence by Stakeholder

Stakeholder	SEV1	SEV2	SEV3	SEV4
Engineering Leadership	Real-time	30min	4hrs	Daily
Executive Team

15min | 1hr | EOD | Weekly | | Customer Support | Real-time | 30min | 2hrs | As needed | | Customers | 15min | 1hr | Optional | None | | Partners | 30min | 2hrs | Optional | None |

Runbook Generation Framework

Dynamic Runbook Components

1. Detection Playbooks

- Monitoring alert definitions - Triage decision trees - Escalation trigger points - Initial response actions

2. Response Playbooks

- Step-by-step mitigation procedures - Rollback instructions - Validation checkpoints - Communication checkpoints

3. Recovery Playbooks

- Service restoration procedures - Data consistency checks - Performance validation - User notification processes

Runbook Template Structure

CODEBLOCK3bash

Classify the incident

echo '{"description": "Users reporting 500 errors, database connections timing out", "affectedusers": "80%", "businessimpact": "high"}' | python scripts/incident_classifier.py

Reconstruct timeline from logs

python scripts/timelinereconstructor.py --input assets/dbincident_events.json --output timeline.md

Generate PIR after resolution

python scripts/pirgenerator.py --incident assets/dbincident_data.json --timeline timeline.md --output pir.md


### Example 2: API Rate Limiting Incident

bash

Quick classification from stdin

echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text

Build timeline from multiple sources

python scripts/timelinereconstructor.py --input assets/apiincident_logs.json --detect-phases --gap-analysis

Generate comprehensive PIR

python scripts/pirgenerator.py --incident assets/apiincident_summary.json --rca-method fishbone --action-items ```

Best Practices

During Incident Response

1. Maintain Calm Leadership

- Stay composed under pressure - Make decisive calls with incomplete information - Communicate confidence while acknowledging uncertainty

2. Document Everything

- All actions taken and their outcomes - Decision rationale, especially for controversial calls - Timeline of events as they happen

3. Effective Communication

- Use clear, jargon-free language - Provide regular updates even when there's no new information - Manage stakeholder expectations proactively

4. Technical Excellence

- Prefer rollbacks to risky fixes under pressure - Validate fixes before declaring resolution - Plan for secondary failures and cascading effects

Post-Incident

1. Blameless Culture

- Focus on system failures, not individual mistakes - Encourage honest reporting of what went wrong - Celebrate learning and improvement opportunities

2. Action Item Discipline

- Assign specific owners and due dates - Track progress publicly - Prioritize based on risk and effort

3. Knowledge Sharing

- Share PIRs broadly within the organization - Update runbooks based on lessons learned - Conduct training sessions for common failure modes

4. Continuous Improvement

- Look for patterns across multiple incidents - Invest in tooling and automation - Regularly review and update processes

Integration with Existing Tools

Monitoring and Alerting

- PagerDuty/Opsgenie integration for escalation
Datadog/Grafana for metrics and dashboards
ELK/Splunk for log analysis and correlation

Communication Platforms

- Slack/Teams for war room coordination
Zoom/Meet for video bridges
Status page providers (Statuspage.io, etc.)

Documentation Systems

- Confluence/Notion for PIR storage
GitHub/GitLab for runbook version control
JIRA/Linear for action item tracking

Change Management

- CI/CD pipeline integration
Deployment tracking systems
Feature flag platforms for quick rollbacks

Conclusion

The Incident Commander skill provides a comprehensive framework for managing incidents from detection through post-incident review. By implementing structured processes, clear communication templates, and thorough analysis tools, teams can improve their incident response capabilities and build more resilient systems.

The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization's specific needs, culture, and technical environment.

Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.

事件指挥官技能

类别： 工程团队
层级： 强大
作者： Claude 技能团队
版本： 1.0.0
最后更新： 2026年2月

概述

事件指挥官技能提供了一套全面的事件响应框架，用于管理从检测到解决及事后审查的技术事件。该技能实施了经过大规模SRE和DevOps团队实战检验的实践，为严重程度分类、时间线重建和深入的事后分析提供了结构化工具。

关键特性

- 自动严重程度分类 - 基于影响和紧急程度指标的智能事件分类
时间线重建 - 将分散的日志和事件转化为连贯的事件叙述
事后审查生成 - 包含多种根本原因分析框架的结构化事后审查报告
沟通模板 - 为利益相关者更新和升级上报预建的模板
应急预案集成 - 从事件模式中生成可操作的应急预案

包含的技能

核心工具

1. 事件分类器 (incident_classifier.py)

- 分析事件描述并输出严重程度等级 - 推荐响应团队和初步行动 - 基于严重程度生成沟通模板

2. 时间线重建器 (timeline_reconstructor.py)

- 处理来自多个来源的带时间戳事件 - 重建按时间顺序的事件时间线 - 识别空白并提供持续时间分析

3. 事后审查生成器 (pir_generator.py)

- 创建全面的事后审查文档 - 应用多种根本原因分析框架（5个为什么、鱼骨图、时间线） - 生成可操作的后续事项

事件响应框架

严重程度分类系统

SEV1 - 严重中断

定义： 影响所有用户或关键业务功能的完全服务故障

特征：

- 面向客户的服务完全不可用
影响用户的数据丢失或损坏
涉及客户数据泄露的安全漏洞
创收系统宕机
违反服务等级协议并产生财务处罚

响应要求：

- 立即升级上报给值班工程师
5分钟内指定事件指挥官
15分钟内通知高管
15分钟内更新公共状态页面
建立作战室
必要时全员投入

沟通频率： 每15分钟一次，直至解决

SEV2 - 重大影响

定义： 影响部分用户或非关键功能的显著降级

特征：

- 部分服务降级（超过25%的用户受影响）
导致用户不满的性能问题
非关键功能不可用
影响生产力的内部工具
不影响用户体验的数据不一致

响应要求：

- 值班工程师15分钟内响应
30分钟内指定事件指挥官
30分钟内更新状态页面
1小时内通知利益相关者
定期团队更新

沟通频率： 积极响应期间每30分钟一次

SEV3 - 轻微影响

定义： 影响有限，存在变通方案

特征：

- 单个功能或组件受影响
少于25%的用户受影响
存在变通方案
性能降级未显著影响用户体验
非紧急监控告警

响应要求：

- 工作时间内2小时内响应
非工作时间可接受下一个工作日响应
内部团队通知
可选的状态页面更新

沟通频率： 仅在关键里程碑时

SEV4 - 低影响

定义： 影响极小，外观问题或计划内维护

特征：

- 外观缺陷
文档问题
日志记录或监控缺口
不影响用户的性能问题
开发/测试环境问题

响应要求：

- 1-2个工作日内响应
标准工单/问题跟踪
无需特殊升级上报

沟通频率： 标准开发周期更新

事件指挥官角色

主要职责

1. 指挥与控制

- 掌控事件响应流程 - 就资源分配做出关键决策 - 协调技术团队与利益相关者之间的沟通 - 保持对所有响应流的情境意识

2. 沟通枢纽

- 定期向利益相关者提供更新 - 管理外部沟通（状态页面、客户通知） - 促进响应团队之间的有效沟通 - 保护响应人员免受外部干扰

3. 流程管理

- 确保正确的事件跟踪和文档记录 - 在保持质量的同时推动问题解决 - 协调团队成员之间的交接 - 必要时计划和执行回滚策略

4. 事后领导

- 确保进行彻底的事后审查 - 推动预防措施的实施 - 与更广泛的组织分享经验教训

决策框架

紧急决策（SEV1/2）：

- 事件指挥官拥有完全授权
偏向行动而非分析
记录决策供日后审查
咨询主题专家但不被阻塞

资源分配：

- 可调动任何必要的团队成员
有权升级上报给高级领导层
可批准用于外部资源的紧急支出
决定沟通渠道和时间安排

技术决策：

- 依靠技术负责人处理实施细节
对速度与风险之间的权衡做出最终决定
批准回滚与向前修复策略
协调测试和验证方法

沟通模板

初始事件通知（SEV1/2）

主题：[SEV{严重程度}] {服务名称} - {简要描述}

事件详情：

- 开始时间：{时间戳}
严重程度：SEV{级别}
影响：{用户影响描述}
当前状态：{调查中/缓解中/已解决}

技术详情：

- 受影响服务：{服务列表}
症状：{用户正在经历的情况}
初步评估：{已知的疑似根本原因}

响应团队：

- 事件指挥官：{姓名}
技术负责人：{姓名}
已参与的主题专家：{列表}

下次更新：{时间戳}
状态页面：{链接}
作战室：{桥接/聊天链接}

{事件指挥官姓名}
{联系方式}

高管摘要（SEV1）

主题：紧急 - 影响客户的中断 - {服务名称}

执行摘要：
{2-3句描述客户影响和业务影响}

关键指标：

- 检测时间：{X分钟}
介入时间：{X分钟}
预估客户影响：{数量/百分比}
当前状态：{状态}
预计解决时间：{时间或调查中}

需要领导层采取的行动：

- [ ] 客户沟通审批
[ ] 公关/沟通协调
[ ] 资源分配决策
[ ] 外部供应商介入

事件指挥官：{姓名}（{联系方式}）
下次更新：{时间}

这是来自我们事件响应系统的自动告警。

客户沟通模板

我们目前正在经历{问题的简要描述}，影响范围{影响范围}。

我们的工程团队在{时间}收到告警，正在积极解决问题。我们将每{频率}提供一次更新，直至问题解决。

我们已知的情况：

- {影响的事实陈述}
{范围的事实陈述}
{响应的简要状态}

我们正在采取的措施：

- {主要响应行动}
{次要响应行动}

变通方案（如有）：
{变通方案步骤或目前没有可用的变通方案}

对于造成的不便，我们深表歉意，并将在获得更多信息时及时分享。

下次更新：{时间}
状态页面：{链接}

利益相关者管理

利益相关者分类

内部利益相关者：

- 工程领导层 - 技术决策和资源分配
产品管理 - 客户影响评估和功能影响
客户支持 - 用户沟通和支持工单管理
销售/客户管理 - 企业客户关系管理
高管团队 - 业务影响决策和外部沟通审批
法务/合规 - 监管报告和责任评估

外部利益相关者：

- 客户 - 服务可用性和影响沟通
合作伙伴 - API可用性和集成影响
供应商 - 第三方服务依赖和支持升级
监管机构 - 受监管行业的合规报告
公众/媒体 - 面向公众的中断透明度

按利益相关者的沟通节奏

利益相关者	SEV1	SEV2	SEV3	SEV4
工程领导层	实时	30分钟	4小时	每日
高管团队

15分钟 | 1小时 | 下班前 | 每周 | | 客户支持 | 实时 | 30分钟 | 2小时 | 按需 | | 客户 | 15分钟 | 1小时 | 可选 | 无 | | 合作伙伴 | 30分钟 | 2小时 | 可选 | 无 |

incident-commander事件指挥官