Canary Deploy
Safe system changes with pre-flight checks, validation, and automatic rollback.
The Problem
System changes can lock you out:
- - SSH hardening breaks remote access
- Firewall rules block needed ports
- Kernel parameters cause instability
- Service restarts break dependencies
Recovery without physical access is painful or impossible.
Quick Start
Before any critical change
CODEBLOCK0
For automated changes
CODEBLOCK1
Protocol A+B (Manual Workflow)
For interactive sessions where you want human-in-the-loop:
Protocol A: Test interactively
- 1. Tell the human: "Open a second SSH session as backup"
- Apply change in the first session
- Ask: "Test connectivity from the second session"
- If it works → confirm
- If it fails → rollback from the backup session
Protocol B: Backup first
- 1. Run INLINECODE0
- Verify backup is valid
- Apply change
- Run INLINECODE1
- If validation fails → INLINECODE2
Always use both A + B together for maximum safety.
What Gets Checked
Baseline capture
- - SSH connectivity (local + remote)
- Open ports (ss -tlnp)
- Running services (systemctl)
- Firewall rules (ufw/iptables)
- Network routes
- DNS resolution
- Config file checksums
Validation
- - All baseline checks re-run
- Diff against baseline
- Any regression = FAIL
Critical Change Categories
| Category | Risk | Example | Recovery |
|---|
| SSH config | 🔴 HIGH | sshd_config changes | Backup session |
| Firewall |
🔴 HIGH | UFW/iptables rules | Pre-change snapshot |
| Network | 🔴 HIGH | Interface/routing changes | Console access |
| Services | 🟡 MEDIUM | systemd unit changes | systemctl restart |
| Kernel params | 🟡 MEDIUM | sysctl changes | Reboot to defaults |
| Packages | 🟢 LOW | apt install/upgrade | apt rollback |
References
See references/incident-report.md for the real incident that inspired this skill.
金丝雀部署
通过预检、验证和自动回滚实现安全的系统变更。
问题描述
系统变更可能导致您被锁定:
- - SSH加固导致远程访问中断
- 防火墙规则阻止所需端口
- 内核参数引发系统不稳定
- 服务重启破坏依赖关系
在没有物理访问权限的情况下,恢复过程既痛苦又几乎不可能。
快速入门
执行任何关键变更前
bash
捕获基线(连接性、服务、端口)
bash scripts/canary-test.sh baseline
执行变更
sudo nano /etc/ssh/sshd_config
验证变更未造成破坏
bash scripts/canary-test.sh validate
如果验证失败:
bash scripts/canary-test.sh rollback
自动化变更
bash
完整流程:基线 → 应用 → 验证 → 失败回滚
bash scripts/critical-update.sh \
--name SSH加固 \
--backup /etc/ssh/sshd_config \
--command sudo sed -i s/PermitRootLogin yes/PermitRootLogin no/ /etc/ssh/sshd_config && sudo systemctl reload sshd \
--validate ssh -o ConnectTimeout=5 localhost echo ok
A+B协议(手动工作流程)
适用于需要人工参与交互式会话的场景:
协议A:交互式测试
- 1. 告知操作人员:打开第二个SSH会话作为备份
- 在第一个会话中应用变更
- 询问:从第二个会话测试连接性
- 如果正常 → 确认变更
- 如果失败 → 从备份会话执行回滚
协议B:先备份
- 1. 运行 bash scripts/canary-test.sh baseline
- 验证备份有效
- 应用变更
- 运行 bash scripts/canary-test.sh validate
- 如果验证失败 → 执行 bash scripts/canary-test.sh rollback
为确保最大安全性,请始终同时使用A+B协议。
检查内容
基线捕获
- - SSH连接性(本地+远程)
- 开放端口(ss -tlnp)
- 运行中的服务(systemctl)
- 防火墙规则(ufw/iptables)
- 网络路由
- DNS解析
- 配置文件校验和
验证
- - 重新运行所有基线检查
- 与基线进行差异对比
- 任何回归 = 失败
关键变更分类
| 类别 | 风险等级 | 示例 | 恢复方式 |
|---|
| SSH配置 | 🔴 高 | sshd_config变更 | 备份会话 |
| 防火墙 |
🔴 高 | UFW/iptables规则 | 变更前快照 |
| 网络 | 🔴 高 | 接口/路由变更 | 控制台访问 |
| 服务 | 🟡 中 | systemd单元变更 | systemctl重启 |
| 内核参数 | 🟡 中 | sysctl变更 | 重启恢复默认值 |
| 软件包 | 🟢 低 | apt安装/升级 | apt回滚 |
参考资料
请参阅 references/incident-report.md 了解启发此技能的真实事件报告。