Failover Gateway for OpenClaw
Deploy a standby OpenClaw gateway that automatically takes over when your primary goes down. Active-passive design with auto-promotion and auto-demotion.
What You Get
- - ~30 second failover — health monitor detects primary down, promotes standby
- Auto-recovery — when primary comes back, standby demotes itself
- Zero split-brain — primary and standby use different channels (no duplicate messages)
- Git-synced workspace — standby pulls latest workspace on promotion
- $12/month — runs on a minimal VPS
Architecture
CODEBLOCK0
The key insight: split your channels between primary and standby. Don't share credentials — give each node exclusive ownership of different channels. This eliminates split-brain entirely.
Channel Split Examples
| Setup | Primary | Standby |
|---|
| RC + Discord | Rocket.Chat (full) | Discord DM only |
| Discord + Telegram |
Discord (full) | Telegram DM only |
| Slack + Discord | Slack (full) | Discord DM only |
Your primary handles everything. The standby is minimal recovery — just enough to stay reachable.
Prerequisites
- - Primary OpenClaw instance running on a VPS
- A second VPS for the standby ($6-12/mo, any provider)
- Tailscale mesh network (or any VPN/private network)
- Git repository for workspace sync (GitHub, GitLab, etc.)
- A second messaging channel for the standby (different from primary)
Step-by-Step Deployment
Phase 1: Provision the Standby VPS
Any cheap VPS works. Recommended: 2GB RAM, Ubuntu 24.04.
CODEBLOCK1
Phase 2: Install OpenClaw
CODEBLOCK2
Phase 3: Failover Config
Create a minimal OpenClaw config on the standby. Only enable the standby channel:
CODEBLOCK3
Important: Disable this channel on your primary to avoid conflicts.
Test it works: openclaw gateway run — verify the bot connects and responds, then stop it.
Phase 4: Deploy Health Monitor
Copy the included scripts/health-monitor.sh to the standby:
CODEBLOCK4
Edit the variables at the top:
- -
PRIMARY_IP — your primary's Tailscale IP - INLINECODE3 — your primary's gateway port (default: 18789)
- INLINECODE4 — (optional) host to rsync secrets from on promotion
Create the systemd services:
/etc/systemd/system/openclaw-health-monitor.service
CODEBLOCK5
/etc/systemd/system/openclaw.service
CODEBLOCK6
Enable the monitor (but NOT the gateway — the monitor starts it on promotion):
CODEBLOCK7
Phase 5: Disable Standby Channel on Primary
This is critical. Remove or disable the standby's channel from your primary config:
CODEBLOCK8
Each node owns its channels exclusively. No sharing, no conflicts.
Phase 6: Test
CODEBLOCK9
Failover Timeline
| Time | Event |
|---|
| 0s | Primary goes down |
| 10s |
First health check fails |
| 20s | Second check fails |
| 30s | Third check fails →
PROMOTE |
| 35s | Git pull, secrets sync |
| 40s | Gateway starting |
| 45s | Standby channel active |
| ~60s |
You're reachable again |
Edge Cases
| Scenario | Result |
|---|
| Primary dies | Standby promotes in ~30-60s |
| Primary + standby die |
You're offline (add a third node?) |
| Network partition | Standby may promote while primary is still running — but since they use different channels, no conflicts |
| Standby reboots | Health monitor auto-restarts (systemd), resumes watching |
| Primary flaps | Promote/demote cycles — health monitor handles it, but consider increasing FAIL_THRESHOLD |
Failback
Recovery is automatic. When the primary comes back:
- 1. Health monitor detects primary healthy
- Stops the standby gateway
- Primary resumes all channels
- Standby returns to watching
No manual intervention needed.
Cost
| Component | Cost |
|---|
| VPS (2GB RAM) | $6-12/mo |
| Tailscale |
Free (personal) |
| Git repo | Free |
|
Total |
$6-12/mo |
Tips
- - Test monthly. Kill your primary, verify failover works. Trust but verify.
- Keep the standby minimal. No crons, no extra channels. It's recovery mode.
- Git push frequently. The standby's workspace is only as fresh as your last push.
- Use Tailscale. It makes cross-VPS networking trivial. No firewall rules, no port forwarding.
- Different bot tokens. If using Discord on both, you need two bot applications. Same bot token = last-connect-wins.
- Monitor the monitor. Check
journalctl -u openclaw-health-monitor occasionally to make sure it's running.
OpenClaw 故障转移网关
部署一个备用 OpenClaw 网关,当主网关宕机时自动接管。采用主动-被动设计,具备自动提升和自动降级功能。
功能特性
- - 约30秒故障转移 — 健康监控检测到主网关宕机,提升备用网关
- 自动恢复 — 主网关恢复后,备用网关自动降级
- 零脑裂 — 主备使用不同通道(无重复消息)
- Git同步工作区 — 备用网关在提升时拉取最新工作区
- 每月12美元 — 运行在最低配置VPS上
架构
主网关(你的主VPS) 备用网关(故障转移VPS)
├─ 完整堆栈(所有通道) ├─ 仅单一通道(如Discord私信)
├─ 所有定时任务 ├─ 无定时任务(恢复模式)
├─ 网关运行中 ✅ ├─ 网关已停止 💤
└─ 推送工作区至Git └─ 健康监控监视主网关
│
├─ 主网关健康 → 休眠
├─ 主网关宕机30秒 → 提升
└─ 主网关恢复 → 降级
关键思路:在主备之间拆分通道。不共享凭证——让每个节点独占不同通道的所有权。这完全消除了脑裂问题。
通道拆分示例
| 配置 | 主网关 | 备用网关 |
|---|
| RC + Discord | Rocket.Chat(完整) | 仅Discord私信 |
| Discord + Telegram |
Discord(完整) | 仅Telegram私信 |
| Slack + Discord | Slack(完整) | 仅Discord私信 |
主网关处理所有事务。备用网关仅用于最小恢复——足以保持可联系状态。
前置条件
- - 在VPS上运行的主OpenClaw实例
- 用于备用网关的第二台VPS(每月6-12美元,任何提供商均可)
- Tailscale 网状网络(或任何VPN/私有网络)
- 用于工作区同步的Git仓库(GitHub、GitLab等)
- 备用网关的第二个消息通道(与主网关不同)
分步部署
第一阶段:配置备用VPS
任何便宜的VPS均可。推荐:2GB内存,Ubuntu 24.04。
bash
加固服务器
ufw allow 22/tcp
ufw enable
apt install -y fail2ban unattended-upgrades
创建openclaw用户
adduser openclaw --disabled-password
usermod -aG sudo openclaw
将你的SSH密钥复制到openclaw用户
安装Tailscale
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --hostname=your-failover-name
第二阶段:安装OpenClaw
bash
以openclaw用户身份执行
curl -fsSL https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash
source ~/.bashrc
nvm install --lts
npm install -g openclaw
克隆工作区
git clone
~/.openclaw/workspace
第三阶段:故障转移配置
在备用网关上创建最小化的OpenClaw配置。仅启用备用通道:
json
{
agents: {
defaults: {
model: {
primary: anthropic/claude-opus-4-6,
fallbacks: [anthropic/claude-sonnet-4-5]
},
workspace: /home/openclaw/.openclaw/workspace
},
list: [{ id: main, default: true }]
},
channels: {
discord: {
enabled: true,
token: <你的Discord机器人令牌>,
dm: {
policy: allowlist,
allowFrom: [<你的Discord用户ID>]
}
}
},
gateway: {
port: 18789,
mode: local,
bind: tailnet
}
}
重要: 在主网关上禁用此通道以避免冲突。
测试是否正常工作:openclaw gateway run — 验证机器人连接和响应,然后停止。
第四阶段:部署健康监控
将附带的 scripts/health-monitor.sh 复制到备用网关:
bash
sudo cp health-monitor.sh /usr/local/bin/openclaw-health-monitor.sh
sudo chmod +x /usr/local/bin/openclaw-health-monitor.sh
编辑顶部的变量:
- - PRIMARYIP — 主网关的Tailscale IP
- PRIMARYPORT — 主网关的网关端口(默认:18789)
- SECRETS_HOST — (可选)提升时用于rsync同步密钥的主机
创建systemd服务:
/etc/systemd/system/openclaw-health-monitor.service
ini
[Unit]
Description=OpenClaw故障转移健康监控
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/openclaw-health-monitor.sh
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
/etc/systemd/system/openclaw.service
ini
[Unit]
Description=OpenClaw网关(故障转移)
After=network-online.target tailscaled.service
Wants=network-online.target
[Service]
Type=simple
User=openclaw
Group=openclaw
WorkingDirectory=/home/openclaw/.openclaw/workspace
ExecStart=/usr/bin/openclaw gateway run
Restart=on-failure
RestartSec=5
Environment=HOME=/home/openclaw
Environment=NODE_ENV=production
[Install]
WantedBy=multi-user.target
启用监控(但不启用网关——监控会在提升时启动它):
bash
sudo systemctl daemon-reload
sudo systemctl enable openclaw-health-monitor
sudo systemctl start openclaw-health-monitor
不要启用openclaw.service — 由监控控制
第五阶段:在主网关上禁用备用通道
这一步至关重要。从主网关配置中移除或禁用备用通道:
json
{
channels: {
discord: { enabled: false }
}
}
每个节点独占其通道。无共享,无冲突。
第六阶段:测试
bash
在主网关上 — 模拟故障
sudo systemctl stop openclaw-gateway # 或终止进程
查看备用网关日志
journalctl -u openclaw-health-monitor -f
预期结果:3次检查失败 → 提升 → 网关启动 → 备用通道上线
在主网关上 — 恢复
sudo systemctl start openclaw-gateway
预期结果:备用网关检测到主网关 → 降级 → 网关停止
故障转移时间线
首次健康检查失败 |
| 20秒 | 第二次检查失败 |
| 30秒 | 第三次检查失败 → 提升 |
| 35秒 | Git拉取,密钥同步 |
| 40秒 | 网关启动中 |
| 45秒 | 备用通道激活 |
| ~60秒 | 你重新可联系 |
边界情况
| 场景 | 结果 |
|---|
| 主网关宕机 | 备用网关在约30-60秒内提升 |
| 主备同时宕机 |
你离线(添加第三个节点?) |
| 网络分区 | 备用网关可能在主网关仍在运行时提升——但由于使用不同通道,无冲突 |
| 备用网关重启 | 健康监控自动重启(systemd),继续监视 |
| 主网关频繁切换 | 提升/降级循环——健康监控可处理,但考虑增加FAIL_THRESHOLD |
回切
恢复是自动的。当主网关恢复时:
- 1. 健康监控检测到主网关健康
- 停止备用网关
- 主网关恢复所有通道
- 备用网关恢复监视状态
无需手动干预。
成本
| 组件 | 成本 |
|---|
| VPS(2GB内存) | 每月6-12美元 |
| Tailscale |
免费(个人版) |
| Git仓库 | 免费 |
| 总计 | 每月6-12美元 |
提示
- - 每月测试一次。 关闭主网关,验证故障转移是否正常工作。信任但验证。
- 保持备用网关最小化。 无定时任务,