Error Recovery Automation Skill

This skill provides patterns for automating the detection and recovery of common OpenClaw errors: gateway unresponsiveness, browser service failures, cron scheduler issues, and other recurring problems. It builds on health‑monitoring and system‑diagnostics by adding automated recovery workflows that can be triggered by cron jobs, heartbeat checks, or external monitoring.

When to use

- A service (gateway, browser, cron) fails intermittently and you want to automate its restart.
You are setting up proactive monitoring and need a recovery plan beyond just detection.
You want to reduce the manual steps required when “Läuft alles?” reveals a failure.
You need to ensure critical OpenClaw components stay running with minimal user intervention.
You are asked to “create a skill for error recovery automation” (this is that skill).

Core patterns

1. Error Detection Patterns

Before automating recovery, you must reliably detect the error. Use these detection methods:

Gateway unresponsive:

- openclaw gateway status returns non‑zero exit code or shows "running": false.
Gateway logs (~/.openclaw/logs/gateway.err.log) contain recent CRITICAL or ERROR entries.
HTTP health endpoint (if configured) returns non‑2xx status.

Browser service unavailable:

- openclaw browser --browser-profile openclaw status --json shows "running": false or CDP not ready.
Browser logs contain connection timeouts or Chrome process failures.
Simple page load via curl to CDP endpoint fails.

Cron scheduler not running:

- openclaw cron status returns "running": false or error.
Cron logs show no recent activity.
Scheduled jobs are not triggered (check openclaw cron list for missed runs).

Memory search disabled:

- memory_search tool returns “disabled” or native‑module error.
INLINECODE12 reports better‑sqlite3 mismatch.

Permission errors:

- File operations fail with EACCES/EPERM.
Logs indicate permission denied on specific paths (archive, logs, config).

2. Automated Recovery Steps

For each error type, define a recovery script that attempts to restore service automatically. The script should:

1. Detect the error (using the patterns above).
Attempt recovery (restart service, fix permissions, rebuild module).
Verify recovery (re‑run detection after a short wait).
Report outcome (exit code 0 for success, non‑zero for persistent failure).

Gateway Recovery Script Template

CODEBLOCK0

Browser Service Recovery Script Template

CODEBLOCK1

Cron Scheduler Recovery Script Template

CODEBLOCK2

Memory Search Recovery Script Template

CODEBLOCK3

3. Integration with Cron for Automated Recovery

Once you have a recovery script, schedule it as a cron job that runs only when the service is likely to fail (e.g., every 30 minutes for browser, every hour for gateway). Use an isolated agent session to execute the script and announce failures.

Example cron job for browser recovery:

CODEBLOCK4

Agent response inside isolated session: The agent reads the script (or inline logic) and executes it via exec. If the script exits with 0, the agent announces success; if non‑zero, the cron delivery forwards the failure message.

Alternative: You can embed the recovery logic directly in the agent’s response (without a separate script) for simplicity, but a script is easier to test and reuse.

4. Escalation When Automation Fails

If automated recovery fails after the maximum attempts, escalate:

- Log the failure in memory/YYYY‑MM‑DD.md with tag error‑recovery‑failed.
Add a task to inbox/agent‑aufgaben.md for manual diagnosis.
Send a high‑priority notification (if supported) to the user.
Fallback to a safe state (e.g., disable the problematic component if possible).

Example escalation snippet:

CODEBLOCK5

5. Testing Recovery Scripts

Before deploying a recovery script as a cron job, test it manually:

1. Simulate the failure (e.g., kill the gateway process, stop the browser service).
Run the recovery script and verify it detects the failure and restarts the service.
Check that the service is functional after recovery.
Verify logs for any unintended side effects.

Example test command:

CODEBLOCK6

Examples

Example 1: Gateway Recovery Automation

Script: scripts/gateway-recovery.sh (see template above). Cron schedule: every 1 hour. Announce only on failure.

Example 2: Browser Recovery Automation

Script: scripts/browser-recovery.sh (see template above). Cron schedule: every 30 minutes. Announce only on failure.

Example 3: Combined Health‑Check + Recovery

A single script that checks multiple services and recovers any that are unhealthy. Useful for a comprehensive “keep‑alive” cron job.

CODEBLOCK7

Schedule this script every 30 minutes with an isolated agentTurn job.

Anti‑Patterns

- Over‑aggressive recovery: Restarting a service too frequently can cause instability. Set reasonable intervals (≥30 minutes) and maximum attempts (≤2).
Silent recovery: If recovery succeeds but you never hear about it, you might not know the service was failing. At minimum, log recovery events to memory/ files.
No verification: Restarting a service without verifying it actually recovered can mask deeper issues. Always re‑check after restart.
Hard‑coded assumptions: Avoid assuming a specific Node version, path, or user ID. Use environment variables or detect them at runtime.
Ignoring dependencies: Browser depends on gateway; restarting browser while gateway is down will fail. Check dependencies in order.
Automating unsafe actions: Do not automate deletion of logs, modification of critical configs, or any irreversible action without a rollback plan.

Related Patterns

- Health‑Monitoring skill – proactive health checks and monitoring.
System‑Diagnostics skill – diagnosing root causes of failures.
Cron‑Job Creation playbook – creating scheduled jobs.
Gateway Health Check and Recovery playbook – specific to gateway.
Browser Service Health Monitoring and Recovery playbook – specific to browser.
Maintenance Execution playbook – incorporating recovery into regular maintenance.

References

- scripts/gateway-recovery.sh (template)
INLINECODE22 (template)
INLINECODE23 (template)
INLINECODE24
INLINECODE25
INLINECODE26
INLINECODE27
INLINECODE28
INLINECODE29
INLINECODE30

Skill Integration

When an OpenClaw error occurs (gateway, browser, cron, memory search), read this skill to create or run an automated recovery script. Store successful recovery patterns in memory/patterns/tools.md. Update pending.md if automation fails and manual intervention is needed.

This skill increases autonomy by providing standardized, automated recovery workflows for common failures, reducing the need for manual intervention and increasing system resilience.

错误恢复自动化技能

本技能提供自动化检测和恢复常见OpenClaw错误的模式：网关无响应、浏览器服务故障、cron调度器问题以及其他重复出现的问题。它在健康监控和系统诊断的基础上，通过添加自动化恢复工作流来增强功能，这些工作流可由cron作业、心跳检查或外部监控触发。

何时使用

- 某个服务（网关、浏览器、cron）间歇性故障，您希望自动重启它。
您正在设置主动监控，并且需要超出单纯检测范围的恢复计划。
您希望减少当“一切正常吗？”显示故障时所需的手动步骤。
您需要确保关键OpenClaw组件以最少用户干预持续运行。
您被要求“创建一个错误恢复自动化技能”（这就是该技能）。

核心模式

1. 错误检测模式

在自动化恢复之前，必须可靠地检测错误。使用以下检测方法：

网关无响应：

- openclaw gateway status 返回非零退出码或显示 running: false。
网关日志（~/.openclaw/logs/gateway.err.log）包含最近的 CRITICAL 或 ERROR 条目。
HTTP健康端点（如果已配置）返回非2xx状态。

浏览器服务不可用：

- openclaw browser --browser-profile openclaw status --json 显示 running: false 或CDP未就绪。
浏览器日志包含连接超时或Chrome进程失败。
通过 curl 到CDP端点的简单页面加载失败。

Cron调度器未运行：

- openclaw cron status 返回 running: false 或错误。
Cron日志显示最近没有活动。
计划任务未被触发（检查 openclaw cron list 是否有遗漏运行）。

内存搜索禁用：

- memory_search 工具返回“已禁用”或原生模块错误。
openclaw doctor --fix 报告better-sqlite3不匹配。

权限错误：

- 文件操作失败，返回 EACCES/EPERM。
日志显示特定路径（存档、日志、配置）权限被拒绝。

2. 自动化恢复步骤

对于每种错误类型，定义一个尝试自动恢复服务的恢复脚本。脚本应：

1. 检测错误（使用上述模式）。
尝试恢复（重启服务、修复权限、重建模块）。
验证恢复（短暂等待后重新运行检测）。
报告结果（成功退出码0，持续失败非零）。

网关恢复脚本模板

bash
#!/bin/bash
set -e

SERVICE=gateway
MAX_ATTEMPTS=2
SLEEP_SECONDS=5

log() { echo [$(date +%Y-%m-%d %H:%M:%S)] $*; }

check() {
openclaw gateway status > /dev/null 2>&1
}

restart() {
openclaw gateway restart
sleep $SLEEP_SECONDS
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log $SERVICE is healthy
exit 0
fi
log $SERVICE is unhealthy, restarting (attempt $((attempt+1))/$MAX_ATTEMPTS)...
restart
((attempt++))
done

log $SERVICE could not be recovered after $MAX_ATTEMPTS attempts
exit 1

浏览器服务恢复脚本模板

bash
#!/bin/bash
set -e

SERVICE=browser
PROFILE=openclaw
MAX_ATTEMPTS=2
SLEEP_SECONDS=8

log() { echo [$(date +%Y-%m-%d %H:%M:%S)] $*; }

check() {
openclaw browser --browser-profile $PROFILE status --json 2>&1 | grep -q running:true
}

restart() {
openclaw browser --browser-profile $PROFILE stop
sleep 2
openclaw browser --browser-profile $PROFILE start
sleep $SLEEP_SECONDS
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log $SERVICE ($PROFILE) is healthy
exit 0
fi
log $SERVICE ($PROFILE) is unhealthy, restarting (attempt $((attempt+1))/$MAX_ATTEMPTS)...
restart
((attempt++))
done

log $SERVICE ($PROFILE) could not be recovered after $MAX_ATTEMPTS attempts
exit 1

Cron调度器恢复脚本模板

bash
#!/bin/bash
set -e

SERVICE=cron
MAX_ATTEMPTS=1
SLEEP_SECONDS=3

log() { echo [$(date +%Y-%m-%d %H:%M:%S)] $*; }

check() {
openclaw cron status 2>&1 | grep -q running:true
}

restart() {
# Cron在网关重启时自动重启。
# 如果cron未运行，重启网关。
openclaw gateway restart
sleep $SLEEP_SECONDS
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log $SERVICE scheduler is running
exit 0
fi
log $SERVICE scheduler is not running, restarting gateway (attempt $((attempt+1))/$MAX_ATTEMPTS)...
restart
((attempt++))
done

log $SERVICE scheduler still not running after $MAX_ATTEMPTS attempts
exit 1

内存搜索恢复脚本模板

bash
#!/bin/bash
set -e

SERVICE=memory_search
MAX_ATTEMPTS=1

log() { echo [$(date +%Y-%m-%d %H:%M:%S)] $*; }

check() {
openclaw memory search --query test 2>&1 | grep -q -v disabled\|Module did not self-register
}

restart() {
# 尝试重建better-sqlite3
cd $(dirname $(which openclaw))/../lib/node_modules/openclaw
npm rebuild better-sqlite3
# 重启网关以加载重建的模块
openclaw gateway restart
sleep 5
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log $SERVICE is functional
exit 0
fi
log $SERVICE is disabled, rebuilding native module (attempt $((attempt+1))/$MAX_ATTEMPTS)...
restart
((attempt++))
done

log $SERVICE could not be recovered after $MAX_ATTEMPTS attempts
exit 1

3. 与Cron集成实现自动化恢复

一旦有了恢复脚本，将其安排为cron作业，仅在服务可能失败时运行（例如，浏览器每30分钟，网关每小时）。使用隔离的代理会话执行脚本并宣布失败。

浏览器恢复的cron作业示例：

bash
openclaw cron add \
--name Browser‑Recovery‑Automation \
--schedule every 30 minutes \
--session isolated \
--payload {kind:agentTurn,message:Run browser recovery automation script,model:default,thinking:low} \
--delivery {mode:announce,channel:telegram}

隔离会话中的代理响应： 代理读取脚本（或内联逻辑）并通过 exec 执行。如果脚本退出码为0，代理宣布成功；如果非零，cron传递转发失败消息。

替代方案： 为简单起见，您可以将恢复逻辑直接嵌入代理的响应中（无需单独的脚本），但脚本更易于测试和重用。

4. 自动化失败时的升级处理

如果自动化恢复在最大尝试次数后失败，进行升级：

- 记录失败到 memory/YYYY‑MM‑DD.md，标签为 error‑recovery‑failed。
添加任务到 inbox/agent‑aufgaben.md 进行手动诊断。
发送高优先级通知（如果支持）给用户。
回退到安全状态（例如，如果可能，禁用有问题的组件）。

升级代码片段示例：

bash
if [ $? -ne 0 ]; then
echo Browser recovery failed. Adding manual diagnosis task.
# 追加到agent-aufgaben.md
echo | 99 | Diagnose browser recovery failure – automated recovery failed after 2 attempts | ⬜ | >> inbox/agent-aufgaben.md
# 存储到memory
echo ## [error] Browser recovery automation failed >> memory/$(date +%Y-%m-%d).md
echo Date

error-recovery-automation错误恢复自动化