Incident Commander Skill
Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026
Overview
The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, timeline reconstruction, and thorough post-incident analysis.
Key Features
- - Automated Severity Classification - Intelligent incident triage based on impact and urgency metrics
- Timeline Reconstruction - Transform scattered logs and events into coherent incident narratives
- Post-Incident Review Generation - Structured PIRs with multiple RCA frameworks
- Communication Templates - Pre-built templates for stakeholder updates and escalations
- Runbook Integration - Generate actionable runbooks from incident patterns
Skills Included
Core Tools
- 1. Incident Classifier (
incident_classifier.py)
- Analyzes incident descriptions and outputs severity levels
- Recommends response teams and initial actions
- Generates communication templates based on severity
- 2. Timeline Reconstructor (
timeline_reconstructor.py)
- Processes timestamped events from multiple sources
- Reconstructs chronological incident timeline
- Identifies gaps and provides duration analysis
- 3. PIR Generator (
pir_generator.py)
- Creates comprehensive Post-Incident Review documents
- Applies multiple RCA frameworks (5 Whys, Fishbone, Timeline)
- Generates actionable follow-up items
Incident Response Framework
Severity Classification System
SEV1 - Critical Outage
Definition: Complete service failure affecting all users or critical business functions
Characteristics:
- - Customer-facing services completely unavailable
- Data loss or corruption affecting users
- Security breaches with customer data exposure
- Revenue-generating systems down
- SLA violations with financial penalties
Response Requirements:
- - Immediate escalation to on-call engineer
- Incident Commander assigned within 5 minutes
- Executive notification within 15 minutes
- Public status page update within 15 minutes
- War room established
- All hands on deck if needed
Communication Frequency: Every 15 minutes until resolution
SEV2 - Major Impact
Definition: Significant degradation affecting subset of users or non-critical functions
Characteristics:
- - Partial service degradation (>25% of users affected)
- Performance issues causing user frustration
- Non-critical features unavailable
- Internal tools impacting productivity
- Data inconsistencies not affecting user experience
Response Requirements:
- - On-call engineer response within 15 minutes
- Incident Commander assigned within 30 minutes
- Status page update within 30 minutes
- Stakeholder notification within 1 hour
- Regular team updates
Communication Frequency: Every 30 minutes during active response
SEV3 - Minor Impact
Definition: Limited impact with workarounds available
Characteristics:
- - Single feature or component affected
- <25% of users impacted
- Workarounds available
- Performance degradation not significantly impacting UX
- Non-urgent monitoring alerts
Response Requirements:
- - Response within 2 hours during business hours
- Next business day response acceptable outside hours
- Internal team notification
- Optional status page update
Communication Frequency: At key milestones only
SEV4 - Low Impact
Definition: Minimal impact, cosmetic issues, or planned maintenance
Characteristics:
- - Cosmetic bugs
- Documentation issues
- Logging or monitoring gaps
- Performance issues with no user impact
- Development/test environment issues
Response Requirements:
- - Response within 1-2 business days
- Standard ticket/issue tracking
- No special escalation required
Communication Frequency: Standard development cycle updates
Incident Commander Role
Primary Responsibilities
- 1. Command and Control
- Own the incident response process
- Make critical decisions about resource allocation
- Coordinate between technical teams and stakeholders
- Maintain situational awareness across all response streams
- 2. Communication Hub
- Provide regular updates to stakeholders
- Manage external communications (status pages, customer notifications)
- Facilitate effective communication between response teams
- Shield responders from external distractions
- 3. Process Management
- Ensure proper incident tracking and documentation
- Drive toward resolution while maintaining quality
- Coordinate handoffs between team members
- Plan and execute rollback strategies if needed
- 4. Post-Incident Leadership
- Ensure thorough post-incident reviews are conducted
- Drive implementation of preventive measures
- Share learnings with broader organization
Decision-Making Framework
Emergency Decisions (SEV1/2):
- - Incident Commander has full authority
- Bias toward action over analysis
- Document decisions for later review
- Consult subject matter experts but don't get blocked
Resource Allocation:
- - Can pull in any necessary team members
- Authority to escalate to senior leadership
- Can approve emergency spend for external resources
- Make call on communication channels and timing
Technical Decisions:
- - Lean on technical leads for implementation details
- Make final calls on trade-offs between speed and risk
- Approve rollback vs. fix-forward strategies
- Coordinate testing and validation approaches
Communication Templates
Initial Incident Notification (SEV1/2)
CODEBLOCK0
Executive Summary (SEV1)
CODEBLOCK1
Customer Communication Template
CODEBLOCK2
Stakeholder Management
Stakeholder Classification
Internal Stakeholders:
- - Engineering Leadership - Technical decisions and resource allocation
- Product Management - Customer impact assessment and feature implications
- Customer Support - User communication and support ticket management
- Sales/Account Management - Customer relationship management for enterprise clients
- Executive Team - Business impact decisions and external communication approval
- Legal/Compliance - Regulatory reporting and liability assessment
External Stakeholders:
- - Customers - Service availability and impact communication
- Partners - API availability and integration impacts
- Vendors - Third-party service dependencies and support escalation
- Regulators - Compliance reporting for regulated industries
- Public/Media - Transparency for public-facing outages
Communication Cadence by Stakeholder
| Stakeholder | SEV1 | SEV2 | SEV3 | SEV4 |
|---|
| Engineering Leadership | Real-time | 30min | 4hrs | Daily |
| Executive Team |
15min | 1hr | EOD | Weekly |
| Customer Support | Real-time | 30min | 2hrs | As needed |
| Customers | 15min | 1hr | Optional | None |
| Partners | 30min | 2hrs | Optional | None |
Runbook Generation Framework
Dynamic Runbook Components
- 1. Detection Playbooks
- Monitoring alert definitions
- Triage decision trees
- Escalation trigger points
- Initial response actions
- 2. Response Playbooks
- Step-by-step mitigation procedures
- Rollback instructions
- Validation checkpoints
- Communication checkpoints
- 3. Recovery Playbooks
- Service restoration procedures
- Data consistency checks
- Performance validation
- User notification processes
Runbook Template Structure
CODEBLOCK3 bash
Classify the incident
echo '{"description": "Users reporting 500 errors, database connections timing out", "affected
users": "80%", "businessimpact": "high"}' | python scripts/incident_classifier.py
Reconstruct timeline from logs
python scripts/timeline
reconstructor.py --input assets/dbincident_events.json --output timeline.md
Generate PIR after resolution
python scripts/pir
generator.py --incident assets/dbincident_data.json --timeline timeline.md --output pir.md
### Example 2: API Rate Limiting Incident
bash
Quick classification from stdin
echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text
Build timeline from multiple sources
python scripts/timeline
reconstructor.py --input assets/apiincident_logs.json --detect-phases --gap-analysis
Generate comprehensive PIR
python scripts/pir
generator.py --incident assets/apiincident_summary.json --rca-method fishbone --action-items
```
Best Practices
During Incident Response
- 1. Maintain Calm Leadership
- Stay composed under pressure
- Make decisive calls with incomplete information
- Communicate confidence while acknowledging uncertainty
- 2. Document Everything
- All actions taken and their outcomes
- Decision rationale, especially for controversial calls
- Timeline of events as they happen
- 3. Effective Communication
- Use clear, jargon-free language
- Provide regular updates even when there's no new information
- Manage stakeholder expectations proactively
- 4. Technical Excellence
- Prefer rollbacks to risky fixes under pressure
- Validate fixes before declaring resolution
- Plan for secondary failures and cascading effects
Post-Incident
- 1. Blameless Culture
- Focus on system failures, not individual mistakes
- Encourage honest reporting of what went wrong
- Celebrate learning and improvement opportunities
- 2. Action Item Discipline
- Assign specific owners and due dates
- Track progress publicly
- Prioritize based on risk and effort
- 3. Knowledge Sharing
- Share PIRs broadly within the organization
- Update runbooks based on lessons learned
- Conduct training sessions for common failure modes
- 4. Continuous Improvement
- Look for patterns across multiple incidents
- Invest in tooling and automation
- Regularly review and update processes
Integration with Existing Tools
Monitoring and Alerting
- - PagerDuty/Opsgenie integration for escalation
- Datadog/Grafana for metrics and dashboards
- ELK/Splunk for log analysis and correlation
Communication Platforms
- - Slack/Teams for war room coordination
- Zoom/Meet for video bridges
- Status page providers (Statuspage.io, etc.)
Documentation Systems
- - Confluence/Notion for PIR storage
- GitHub/GitLab for runbook version control
- JIRA/Linear for action item tracking
Change Management
- - CI/CD pipeline integration
- Deployment tracking systems
- Feature flag platforms for quick rollbacks
Conclusion
The Incident Commander skill provides a comprehensive framework for managing incidents from detection through post-incident review. By implementing structured processes, clear communication templates, and thorough analysis tools, teams can improve their incident response capabilities and build more resilient systems.
The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization's specific needs, culture, and technical environment.
Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.
事件指挥官技能
类别: 工程团队
层级: 强大
作者: Claude 技能团队
版本: 1.0.0
最后更新: 2026年2月
概述
事件指挥官技能提供了一套全面的事件响应框架,用于管理从检测到解决及事后审查的技术事件。该技能实施了经过大规模SRE和DevOps团队实战检验的实践,为严重程度分类、时间线重建和深入的事后分析提供了结构化工具。
关键特性
- - 自动严重程度分类 - 基于影响和紧急程度指标的智能事件分类
- 时间线重建 - 将分散的日志和事件转化为连贯的事件叙述
- 事后审查生成 - 包含多种根本原因分析框架的结构化事后审查报告
- 沟通模板 - 为利益相关者更新和升级上报预建的模板
- 应急预案集成 - 从事件模式中生成可操作的应急预案
包含的技能
核心工具
- 1. 事件分类器 (incident_classifier.py)
- 分析事件描述并输出严重程度等级
- 推荐响应团队和初步行动
- 基于严重程度生成沟通模板
- 2. 时间线重建器 (timeline_reconstructor.py)
- 处理来自多个来源的带时间戳事件
- 重建按时间顺序的事件时间线
- 识别空白并提供持续时间分析
- 3. 事后审查生成器 (pir_generator.py)
- 创建全面的事后审查文档
- 应用多种根本原因分析框架(5个为什么、鱼骨图、时间线)
- 生成可操作的后续事项
事件响应框架
严重程度分类系统
SEV1 - 严重中断
定义: 影响所有用户或关键业务功能的完全服务故障
特征:
- - 面向客户的服务完全不可用
- 影响用户的数据丢失或损坏
- 涉及客户数据泄露的安全漏洞
- 创收系统宕机
- 违反服务等级协议并产生财务处罚
响应要求:
- - 立即升级上报给值班工程师
- 5分钟内指定事件指挥官
- 15分钟内通知高管
- 15分钟内更新公共状态页面
- 建立作战室
- 必要时全员投入
沟通频率: 每15分钟一次,直至解决
SEV2 - 重大影响
定义: 影响部分用户或非关键功能的显著降级
特征:
- - 部分服务降级(超过25%的用户受影响)
- 导致用户不满的性能问题
- 非关键功能不可用
- 影响生产力的内部工具
- 不影响用户体验的数据不一致
响应要求:
- - 值班工程师15分钟内响应
- 30分钟内指定事件指挥官
- 30分钟内更新状态页面
- 1小时内通知利益相关者
- 定期团队更新
沟通频率: 积极响应期间每30分钟一次
SEV3 - 轻微影响
定义: 影响有限,存在变通方案
特征:
- - 单个功能或组件受影响
- 少于25%的用户受影响
- 存在变通方案
- 性能降级未显著影响用户体验
- 非紧急监控告警
响应要求:
- - 工作时间内2小时内响应
- 非工作时间可接受下一个工作日响应
- 内部团队通知
- 可选的状态页面更新
沟通频率: 仅在关键里程碑时
SEV4 - 低影响
定义: 影响极小,外观问题或计划内维护
特征:
- - 外观缺陷
- 文档问题
- 日志记录或监控缺口
- 不影响用户的性能问题
- 开发/测试环境问题
响应要求:
- - 1-2个工作日内响应
- 标准工单/问题跟踪
- 无需特殊升级上报
沟通频率: 标准开发周期更新
事件指挥官角色
主要职责
- 1. 指挥与控制
- 掌控事件响应流程
- 就资源分配做出关键决策
- 协调技术团队与利益相关者之间的沟通
- 保持对所有响应流的情境意识
- 2. 沟通枢纽
- 定期向利益相关者提供更新
- 管理外部沟通(状态页面、客户通知)
- 促进响应团队之间的有效沟通
- 保护响应人员免受外部干扰
- 3. 流程管理
- 确保正确的事件跟踪和文档记录
- 在保持质量的同时推动问题解决
- 协调团队成员之间的交接
- 必要时计划和执行回滚策略
- 4. 事后领导
- 确保进行彻底的事后审查
- 推动预防措施的实施
- 与更广泛的组织分享经验教训
决策框架
紧急决策(SEV1/2):
- - 事件指挥官拥有完全授权
- 偏向行动而非分析
- 记录决策供日后审查
- 咨询主题专家但不被阻塞
资源分配:
- - 可调动任何必要的团队成员
- 有权升级上报给高级领导层
- 可批准用于外部资源的紧急支出
- 决定沟通渠道和时间安排
技术决策:
- - 依靠技术负责人处理实施细节
- 对速度与风险之间的权衡做出最终决定
- 批准回滚与向前修复策略
- 协调测试和验证方法
沟通模板
初始事件通知(SEV1/2)
主题:[SEV{严重程度}] {服务名称} - {简要描述}
事件详情:
- - 开始时间:{时间戳}
- 严重程度:SEV{级别}
- 影响:{用户影响描述}
- 当前状态:{调查中/缓解中/已解决}
技术详情:
- - 受影响服务:{服务列表}
- 症状:{用户正在经历的情况}
- 初步评估:{已知的疑似根本原因}
响应团队:
- - 事件指挥官:{姓名}
- 技术负责人:{姓名}
- 已参与的主题专家:{列表}
下次更新:{时间戳}
状态页面:{链接}
作战室:{桥接/聊天链接}
{事件指挥官姓名}
{联系方式}
高管摘要(SEV1)
主题:紧急 - 影响客户的中断 - {服务名称}
执行摘要:
{2-3句描述客户影响和业务影响}
关键指标:
- - 检测时间:{X分钟}
- 介入时间:{X分钟}
- 预估客户影响:{数量/百分比}
- 当前状态:{状态}
- 预计解决时间:{时间或调查中}
需要领导层采取的行动:
- - [ ] 客户沟通审批
- [ ] 公关/沟通协调
- [ ] 资源分配决策
- [ ] 外部供应商介入
事件指挥官:{姓名}({联系方式})
下次更新:{时间}
这是来自我们事件响应系统的自动告警。
客户沟通模板
我们目前正在经历{问题的简要描述},影响范围{影响范围}。
我们的工程团队在{时间}收到告警,正在积极解决问题。我们将每{频率}提供一次更新,直至问题解决。
我们已知的情况:
- - {影响的事实陈述}
- {范围的事实陈述}
- {响应的简要状态}
我们正在采取的措施:
变通方案(如有):
{变通方案步骤或目前没有可用的变通方案}
对于造成的不便,我们深表歉意,并将在获得更多信息时及时分享。
下次更新:{时间}
状态页面:{链接}
利益相关者管理
利益相关者分类
内部利益相关者:
- - 工程领导层 - 技术决策和资源分配
- 产品管理 - 客户影响评估和功能影响
- 客户支持 - 用户沟通和支持工单管理
- 销售/客户管理 - 企业客户关系管理
- 高管团队 - 业务影响决策和外部沟通审批
- 法务/合规 - 监管报告和责任评估
外部利益相关者:
- - 客户 - 服务可用性和影响沟通
- 合作伙伴 - API可用性和集成影响
- 供应商 - 第三方服务依赖和支持升级
- 监管机构 - 受监管行业的合规报告
- 公众/媒体 - 面向公众的中断透明度
按利益相关者的沟通节奏
| 利益相关者 | SEV1 | SEV2 | SEV3 | SEV4 |
|---|
| 工程领导层 | 实时 | 30分钟 | 4小时 | 每日 |
| 高管团队 |
15分钟 | 1小时 | 下班前 | 每周 |
| 客户支持 | 实时 | 30分钟 | 2小时 | 按需 |
| 客户 | 15分钟 | 1小时 | 可选 | 无 |
| 合作伙伴 | 30分钟 | 2小时 | 可选 | 无 |
应急预案生成框架
动态应急预案组件
1