Boot Resume
Zero-cooperation session recovery after gateway restart. No checkpoints, no hooks, no agent involvement — just reads the evidence and picks up where it left off.
Problem
When the gateway restarts, any in-flight agent turn dies mid-execution. Session history is preserved on disk, but the agent doesn't know it needs to continue. Users must manually tell each interrupted session to resume.
Checkpoint-based approaches require the agent to save state before dying. Unexpected kills (SIGKILL, OOM, power loss) bypass this entirely.
Solution
A deterministic shell script runs on every gateway start via systemd ExecStartPost. No LLM in the detection loop.
CODEBLOCK0
- 1. Scan — finds sessions updated within the last 20 minutes
- Detect — reads the last 5 JSONL lines to classify session state
- Resume — schedules a one-shot
openclaw cron add --system-event --wake now to inject a continuation prompt
Key insight: the JSONL session files already contain all the evidence needed to detect an interruption — no pre-save required.
Detection Rules
| Last JSONL Entry | Status | Meaning |
|---|
| INLINECODE2 | INLINECODE3 | Tool returned, agent never processed it |
| INLINECODE4 (empty text) |
INTERRUPTED | Tool call dispatched, killed before response |
|
user (non-trivial) |
INTERRUPTED | Message received, never processed |
|
assistant (with text) |
COMPLETE | Session ended normally — skip |
|
user (trivial: "ok", emoji) |
TRIVIAL | No meaningful request pending — skip |
Install
One command
CODEBLOCK1
Deploys three components:
- -
boot-resume-check.sh → INLINECODE13 - INLINECODE14 → systemd drop-in (triggers script on every gateway start)
- INLINECODE15 → systemd user service (triggers script on system wake from sleep/suspend)
Manual
CODEBLOCK2
Verify
CODEBLOCK3
Expected output:
CODEBLOCK4
Test
- 1. Send a message that triggers a multi-step task (web search, code analysis, etc.)
- Wait for the agent to start processing (tool calls in flight)
- INLINECODE16
- Agent resumes automatically within ~35 seconds
Slash Command
When invoked as /boot-resume, run the script with --no-wait to skip the startup delay:
CODEBLOCK5
Report results to the user: which sessions were resumed, or that none were found.
Configuration
| Variable | Default | Description |
|---|
| INLINECODE19 | INLINECODE20 | How far back to scan for interrupted sessions |
| INLINECODE21 |
20s | Delay before injecting the resume event |
Edit at the top of scripts/boot-resume-check.sh.
Features
- - Dual trigger — covers both gateway restart (ExecStartPost) and system sleep/wake (systemd sleep.target)
- Multi-agent support — scans all agents under
~/.openclaw/agents/, not just INLINECODE25 - Smart filtering — skips system, heartbeat, cron, and subagent sessions automatically
- Deduplication — respects
restart-resume.json to avoid double-resuming planned restarts - Log rotation — auto-truncates log at 1000 lines
- Error visibility — Python and cron errors are logged, not swallowed
- Unique job names — timestamp-based to prevent conflicts on rapid restarts
Comparison
| Approach | Pre-save required | Survives SIGKILL | LLM-free |
|---|
| Checkpoint / snapshot files | Yes | No | No |
| Pre-restart state dump |
Yes | No | No |
| Session history replay | Yes | Partial | No |
|
Post-hoc JSONL detection (this skill) |
No |
Yes |
Yes |
Logs
Output: INLINECODE27
Each run logs: timestamp, scan window, candidate count, per-session status, and whether a resume job was armed.
Limitations
- - 20-minute scan window (configurable) — sessions idle longer than this are not resumed
- Resume prompt is generic — the agent relies on session context for continuity
- Telegram/Discord message queues already handle unprocessed incoming messages — this skill targets mid-execution interruptions
- Requires systemd (Linux); macOS users need manual launchd setup
Uninstall
CODEBLOCK6
启动恢复
网关重启后零协作的会话恢复。无需检查点、无需钩子、无需代理参与——只需读取证据,从中断处继续执行。
问题
当网关重启时,任何正在进行的代理轮次都会在执行中途终止。会话历史记录保存在磁盘上,但代理不知道需要继续执行。用户必须手动告诉每个中断的会话恢复执行。
基于检查点的方法要求代理在终止之前保存状态。意外终止(SIGKILL、OOM、断电)完全绕过了这一机制。
解决方案
一个确定性的shell脚本通过systemd的ExecStartPost在每个网关启动时运行。检测循环中无需LLM参与。
┌─────────┐ ┌──────────┐ ┌──────────┐
│ 扫描 │ ──▶ │ 检测 │ ──▶ │ 恢复 │
│ 会话 │ │ JSONL │ │ 添加cron │
│ .json │ │ 尾部 │ │--系统事件│
└─────────┘ └──────────┘ └──────────┘
- 1. 扫描 — 查找最近20分钟内更新的会话
- 检测 — 读取最后5行JSONL以分类会话状态
- 恢复 — 安排一次性openclaw cron add --system-event --wake now注入继续提示
关键洞察:JSONL会话文件已包含检测中断所需的所有证据——无需预先保存。
检测规则
| 最后JSONL条目 | 状态 | 含义 |
|---|
| toolResult | 已中断 | 工具已返回,代理未处理 |
| assistant(空文本) |
已中断 | 工具调用已分发,在响应前被终止 |
| user(非平凡) | 已中断 | 消息已接收,从未处理 |
| assistant(含文本) | 已完成 | 会话正常结束 — 跳过 |
| user(平凡:ok、表情符号) | 平凡 | 无待处理的实质性请求 — 跳过 |
安装
一键安装
bash
bash {baseDir}/install.sh
部署三个组件:
- - boot-resume-check.sh → ~/.openclaw/workspace/scripts/
- boot-resume.conf → systemd drop-in(每次网关启动时触发脚本)
- boot-resume-wake.service → systemd用户服务(系统从睡眠/挂起唤醒时触发脚本)
手动安装
bash
cp {baseDir}/scripts/boot-resume-check.sh ~/.openclaw/workspace/scripts/
chmod +x ~/.openclaw/workspace/scripts/boot-resume-check.sh
mkdir -p ~/.config/systemd/user/openclaw-gateway.service.d
cp {baseDir}/templates/boot-resume.conf ~/.config/systemd/user/openclaw-gateway.service.d/
cp {baseDir}/templates/boot-resume-wake.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable boot-resume-wake.service
验证
bash
systemctl --user restart openclaw-gateway
sleep 20
cat /tmp/openclaw/boot-resume.log
预期输出:
[boot-resume] now=... cut=... (20分钟窗口)
[boot-resume] 扫描代理: main
[boot-resume] 候选: 0 (agent=main)
[boot-resume] 完成
测试
- 1. 发送一条触发多步骤任务的消息(网络搜索、代码分析等)
- 等待代理开始处理(进行中的工具调用)
- systemctl --user restart openclaw-gateway
- 代理在约35秒内自动恢复
斜杠命令
当作为/boot-resume调用时,使用--no-wait参数运行脚本以跳过启动延迟:
bash
bash {baseDir}/scripts/boot-resume-check.sh --no-wait
向用户报告结果:哪些会话已恢复,或未找到任何会话。
配置
| 变量 | 默认值 | 描述 |
|---|
| WINDOW_MINUTES | 20 | 向后扫描中断会话的时间范围 |
| DELAY |
20s | 注入恢复事件前的延迟 |
在scripts/boot-resume-check.sh顶部编辑。
特性
- - 双重触发 — 覆盖网关重启(ExecStartPost)和系统睡眠/唤醒(systemd sleep.target)
- 多代理支持 — 扫描~/.openclaw/agents/下的所有代理,不仅限于main
- 智能过滤 — 自动跳过系统、心跳、cron和子代理会话
- 去重 — 尊重restart-resume.json以避免对计划重启进行双重恢复
- 日志轮转 — 自动截断日志至1000行
- 错误可见性 — Python和cron错误会被记录,不会吞没
- 唯一作业名称 — 基于时间戳,防止快速重启时的冲突
对比
| 方法 | 需要预先保存 | 能抵御SIGKILL | 无需LLM |
|---|
| 检查点/快照文件 | 是 | 否 | 否 |
| 重启前状态转储 |
是 | 否 | 否 |
| 会话历史重放 | 是 | 部分 | 否 |
|
事后JSONL检测(本技能) |
否 |
是 |
是 |
日志
输出:/tmp/openclaw/boot-resume.log
每次运行记录:时间戳、扫描窗口、候选数量、每个会话的状态,以及是否已武装恢复作业。
局限性
- - 20分钟扫描窗口(可配置)——空闲时间超过此值的会话不会恢复
- 恢复提示是通用的——代理依赖会话上下文实现连续性
- Telegram/Discord消息队列已处理未处理的消息——本技能针对执行中中断
- 需要systemd(Linux);macOS用户需要手动设置launchd
卸载
bash
rm ~/.config/systemd/user/openclaw-gateway.service.d/boot-resume.conf
systemctl --user disable boot-resume-wake.service 2>/dev/null
rm ~/.config/systemd/user/boot-resume-wake.service
systemctl --user daemon-reload
rm ~/.openclaw/workspace/scripts/boot-resume-check.sh
rm -rf ~/.openclaw/workspace/skills/boot-resume