Gateway Watchdog
Detect abnormal error patterns in the OpenClaw Gateway before they cause damage. Works with all channels: Telegram, WhatsApp, Discord, Slack, Signal, iMessage, Feishu, and more.
Born from a real incident: a silent try-catch caused 76,744 failed retries in 8 hours — undetected until the API quota was exhausted.
What It Detects
| Category | Patterns |
|---|
| Rate limiting | HTTP 429, rate.limit, INLINECODE2 |
| Server errors |
HTTP 5xx status codes |
| Auth/permission | HTTP 401/403,
unauthorized,
forbidden,
token expired |
| Network errors |
ETIMEDOUT,
ECONNREFUSED,
ECONNRESET,
ENOTFOUND,
socket hang up |
| Delivery failures |
sendMessage failed,
deliver failed,
fetch failed |
|
Custom | User-defined via
WATCHDOG_EXTRA_PATTERNS env var |
Smart Analysis
- - Error rate (errors/min) — more meaningful than raw count
- Spike detection — alerts when errors jump 3x+ vs last check
- Error concentration — flags when 80%+ errors are one type (single fault source)
Quick Start
CODEBLOCK0
Heartbeat integration
Add to HEARTBEAT.md:
CODEBLOCK1
Cron (optional)
CODEBLOCK2
Configuration
All via environment variables:
| Variable | Default | Description |
|---|
| INLINECODE16 | INLINECODE17 | Error count that triggers alert |
| INLINECODE18 |
30 | Monitoring window in minutes |
|
WATCHDOG_SPIKE_RATIO |
3 | Alert when errors jump Nx vs last check |
|
WATCHDOG_EXTRA_PATTERNS |
(empty) | Custom regex patterns (e.g.,
99991400\|99991403) |
|
WATCHDOG_STATE |
~/.local/state/gateway-watchdog/state.json | State file |
|
WATCHDOG_LOG |
~/.local/state/gateway-watchdog/history.log | History log |
Adding channel-specific patterns
CODEBLOCK3
Interpreting Results
🔴 Alert (Chinese locale)
CODEBLOCK4
🔴 Alert (English equivalent)
CODEBLOCK5
💚 Healthy
No output from
check mode.
Limitations
- - Requires systemd + journalctl (falls back to
~/.openclaw/logs/ on macOS) - Reactive, not preventive
- Cannot pinpoint which extension is failing — check error details for clues
Security
- - Read-only: Only reads logs
- No credentials: No API keys accessed
- No network: No outbound requests
- User state only: State in
~/.local/state/gateway-watchdog/ (XDG standard, no elevated permissions needed)
Gateway Watchdog
在OpenClaw Gateway中检测异常错误模式,防止其造成损害。支持所有渠道:Telegram、WhatsApp、Discord、Slack、Signal、iMessage、飞书等。
源自真实事件:一个静默的try-catch在8小时内导致76,744次失败重试——直到API配额耗尽才被发现。
检测内容
| 类别 | 模式 |
|---|
| 速率限制 | HTTP 429、rate.limit、too many requests |
| 服务器错误 |
HTTP 5xx状态码 |
| 认证/权限 | HTTP 401/403、unauthorized、forbidden、token expired |
| 网络错误 | ETIMEDOUT、ECONNREFUSED、ECONNRESET、ENOTFOUND、socket hang up |
| 投递失败 | sendMessage failed、deliver failed、fetch failed |
|
自定义 | 通过WATCHDOG
EXTRAPATTERNS环境变量用户自定义 |
智能分析
- - 错误率(错误数/分钟)——比原始计数更有意义
- 突增检测——当错误数相比上次检查激增3倍以上时发出警报
- 错误集中度——当80%以上错误为同一类型时标记(单一故障源)
快速开始
bash
bash scripts/gateway-watchdog.sh check # 静默模式,仅在错误超过阈值时输出
bash scripts/gateway-watchdog.sh verbose # 始终输出完整报告
bash scripts/gateway-watchdog.sh history # 显示监控历史
bash scripts/gateway-watchdog.sh trend # 最近24小时错误趋势
心跳集成
添加到HEARTBEAT.md:
markdown
网关错误监控(每次心跳)
- - 运行 ~/.openclaw/workspace/skills/gateway-watchdog/scripts/gateway-watchdog.sh check
- 如果输出非空,立即向用户报告
- 无输出 = 健康,跳过报告
定时任务(可选)
bash
openclaw cron add \
--name gateway-watchdog \
--schedule /30 * \
--task 运行 gateway-watchdog.sh verbose。如果检测到错误,将报告通知用户。 \
--channel last
配置
全部通过环境变量:
| 变量 | 默认值 | 描述 |
|---|
| WATCHDOGTHRESHOLD | 30 | 触发警报的错误数量 |
| WATCHDOGWINDOW |
30 | 监控时间窗口(分钟) |
| WATCHDOG
SPIKERATIO | 3 | 错误数相比上次检查激增N倍时触发警报 |
| WATCHDOG
EXTRAPATTERNS |
(空) | 自定义正则模式(例如:99991400\|99991403) |
| WATCHDOG_STATE | ~/.local/state/gateway-watchdog/state.json | 状态文件 |
| WATCHDOG_LOG | ~/.local/state/gateway-watchdog/history.log | 历史日志 |
添加渠道特定模式
bash
飞书特定错误码
export WATCHDOG
EXTRAPATTERNS=99991400|99991403|99991404|99991429
Telegram特定
export WATCHDOG
EXTRAPATTERNS=Too Many Requests|FLOOD_WAIT|bot was blocked
Discord特定
export WATCHDOG
EXTRAPATTERNS=DiscordAPIError|Missing Permissions|Unknown Channel
结果解读
🔴 警报(中文)
🔴 Gateway 最近 30 分钟出现 150 条异常错误(阈值: 30,5/min)
📈 错误突增: 12 → 150(3倍阈值触发)
错误分类:
429/限流: 120
5xx服务端错误: 5
认证/权限: 0
网络错误: 5
消息投递失败: 20
⚠️ 单一错误类型「429/限流」占比 80%,可能是单一故障源
🔴 警报(英文等效)
🔴 Gateway detected 150 errors in the last 30 min (threshold: 30, 5/min)
📈 Error spike: 12 → 150 (3x threshold triggered)
Error breakdown:
429/Rate-limit: 120
5xx Server errors: 5
Auth/Permission: 0
Network errors: 5
Delivery failures: 20
⚠️ Single error type 429/Rate-limit accounts for 80%+ — likely a single fault source
💚 健康
check模式无输出。
限制
- - 需要systemd + journalctl(macOS上回退到~/.openclaw/logs/)
- 被动检测,非主动预防
- 无法精确定位哪个扩展失败——请查看错误详情寻找线索
安全性
- - 只读:仅读取日志
- 无凭证:不访问任何API密钥
- 无网络:不发出任何出站请求
- 仅用户状态:状态存储在~/.local/state/gateway-watchdog/(XDG标准,无需提升权限)