Watchdog — Cron Health Monitor
Monitors all cron jobs for failures and auto-fixes them. Posts to Slack only when issues are found or unfixable errors exist.
CRITICAL: Slack Routing
When sending messages to Slack, you MUST specify channel: "slack" in every message tool call:
CODEBLOCK0
Without channel: "slack", messages will fail silently.
Schedule
Every 6 hours: 5, 11, 17, 23 CT
Steps
- 1.
cron(action: "list") — get all jobs and their current status - For each job, check:
lastStatus error? consecutiveErrors > 0? What was lastError? - For model not allowed errors: use
cron(action: "update", jobId: "...", patch: { payload: { model: "anthropic/claude-sonnet-4-6" } }), then force-run, log change - For timeout errors: use
cron(action: "update", jobId: "...", patch: { payload: { timeoutSeconds: <current + 60> } }) — NEVER edit cron JSON files directly - For other errors: analyze, attempt fix if possible, or flag as unresolved
- Post to Slack
C0AHYTV5WP7 (#morning-briefs) ONLY if issues were found/fixed or unfixable errors exist - If everything is healthy: no Slack message (silent pass)
CRITICAL: Never Edit cron/jobs.json Directly
Always use the cron tool with action="update" to modify job settings. Direct file edits break the cron system.
Slack Alert Format
CODEBLOCK1
Only send if at least one issue exists.
Watchdog — Cron健康监控器
监控所有cron作业的失败情况并自动修复。仅在发现问题或存在无法修复的错误时向Slack发送消息。
关键:Slack路由
向Slack发送消息时,必须在每次消息工具调用中指定channel: slack:
message(action: send, channel: slack, target: C0AHYTV5WP7, message: ...)
未指定channel: slack时,消息将静默失败。
调度
每6小时一次:CT时间5点、11点、17点、23点
步骤
- 1. cron(action: list) — 获取所有作业及其当前状态
- 对每个作业,检查:lastStatus是否有错误?consecutiveErrors > 0?lastError是什么?
- 对于模型不允许错误:使用cron(action: update, jobId: ..., patch: { payload: { model: anthropic/claude-sonnet-4-6 } }),然后强制运行,记录变更
- 对于超时错误:使用cron(action: update, jobId: ..., patch: { payload: { timeoutSeconds: <当前值 + 60> } }) — 切勿直接编辑cron JSON文件
- 对于其他错误:分析,尽可能尝试修复,或标记为未解决
- 仅当发现/修复了问题或存在无法修复的错误时,才向Slack C0AHYTV5WP7 (#morning-briefs) 发送消息
- 如果一切正常:不发送Slack消息(静默通过)
关键:切勿直接编辑cron/jobs.json
始终使用action=update的cron工具来修改作业设置。直接编辑文件会破坏cron系统。
Slack告警格式
🐺 Watchdog报告 — <时间戳>
✅ 已修复:<作业名称> — <修复内容>
❌ 无法修复:<作业名称> — <错误摘要>
⚠️ 已标记:<作业名称> — <问题描述>
仅当至少存在一个问题时才发送。