Server Watchdog
Monitor and auto-heal remote servers via SSH. Check services, databases, disk, memory — restart what's down, alert what's wrong.
Prerequisites
- - SSH access to target server (password or key-based)
- INLINECODE0 available locally (for password-based SSH)
- Target server runs PM2, systemd, or Docker for service management
Quick Reference
Check PM2 services
CODEBLOCK0
Check MongoDB
CODEBLOCK1
Check disk & memory
CODEBLOCK2
Workflow
- 1. Diagnose — SSH in, check service status, logs, disk, memory
- Identify — Parse logs for errors, crashes, OOM, or unclean shutdowns
- Fix — Restart crashed services (
pm2 restart, net start, systemctl restart) - Verify — Confirm service is back up and responding
- Alert — Notify user via messaging with summary
Crash Analysis
When a service is down, check these in order:
- 1. Service logs —
pm2 logs, journalctl -u service, Windows Event Log - Application logs — Check log files at configured paths
- System events — OOM killer, unexpected shutdowns, disk full
- Database logs — MongoDB: check
mongod.log for Fatal ("s":"F") entries
MongoDB crash patterns
CODEBLOCK3
Auto-Heal Recipes
PM2 service restart
CODEBLOCK4
MongoDB (Windows)
CODEBLOCK5
MongoDB (Linux)
CODEBLOCK6
Deploy watchdog service
For persistent monitoring, deploy the included watchdog script:
- 1. Copy
scripts/mongodb-watchdog.js to target server - Install: INLINECODE9
- Start: INLINECODE10
- Save: INLINECODE11
SSH with password (via expect)
When key-based auth isn't available:
CODEBLOCK7
Alert Template
CODEBLOCK8
服务器看门狗
通过SSH监控并自动修复远程服务器。检查服务、数据库、磁盘、内存——重启故障项,报告异常项。
前置条件
- - 目标服务器的SSH访问权限(密码或密钥认证)
- 本地安装expect(用于密码认证SSH)
- 目标服务器运行PM2、systemd或Docker进行服务管理
快速参考
检查PM2服务
bash
ssh user@host pm2 list
ssh user@host pm2 logs --lines 20 --nostream
检查MongoDB
bash
Windows
ssh user@host net start | findstr MongoDB
ssh user@host powershell -Command \(Test-NetConnection -ComputerName 127.0.0.1 -Port 27017).TcpTestSucceeded\
Linux
ssh user@host systemctl status mongod
ssh user@host mongosh --eval db.runCommand({ping:1}) --quiet
检查磁盘与内存
bash
Linux
ssh user@host df -h && free -h
Windows
ssh user@host powershell -Command \Get-PSDrive -PSProvider FileSystem | Select Root,Used,Free; \$os=Get-CimInstance Win32_OperatingSystem; Write-Output (RAM: +[math]::Round((\$os.TotalVisibleMemorySize-\$os.FreePhysicalMemory)/1MB,1)+GB / +[math]::Round(\$os.TotalVisibleMemorySize/1MB,1)+GB)\
工作流程
- 1. 诊断 — SSH登录,检查服务状态、日志、磁盘、内存
- 识别 — 解析日志,查找错误、崩溃、内存溢出或非正常关闭
- 修复 — 重启崩溃的服务(pm2 restart、net start、systemctl restart)
- 验证 — 确认服务已恢复并正常响应
- 告警 — 通过消息通知用户并附上摘要
崩溃分析
当服务宕机时,按以下顺序检查:
- 1. 服务日志 — pm2 logs、journalctl -u service、Windows事件日志
- 应用日志 — 检查配置路径下的日志文件
- 系统事件 — OOM杀手、意外关机、磁盘已满
- 数据库日志 — MongoDB:检查mongod.log中的致命错误(s:F条目)
MongoDB崩溃模式
s:F — 致命错误(崩溃)
Unhandled exception — 内部错误(通常与FTDC相关)
Detected unclean shutdown — 进程被强制终止,未正常关闭
WiredTiger error — 存储引擎损坏
自动修复方案
PM2服务重启
bash
pm2 restart
pm2 save # 重启后持久化
MongoDB(Windows)
bash
net stop MongoDB
timeout /t 5
net start MongoDB
MongoDB(Linux)
bash
sudo systemctl restart mongod
部署看门狗服务
如需持久监控,请部署附带的看门狗脚本:
- 1. 将scripts/mongodb-watchdog.js复制到目标服务器
- 安装:npm init -y && npm install mongodb
- 启动:pm2 start mongodb-watchdog.js --name mongodb-watchdog
- 保存:pm2 save
密码SSH(通过expect)
当密钥认证不可用时:
bash
expect -c set timeout 20
spawn ssh -o StrictHostKeyChecking=no user@host COMMAND
expect {
password: { send PASSWORD\r; exp_continue }
eof
}
告警模板
🚨 服务器告警 — [主机名]
⏰ 时间:[时间戳]
❌ 问题:[服务] 已宕机
📋 原因:[日志中的崩溃原因]
🔄 操作:已自动重启 [服务]
✅ 状态:[服务] 已恢复在线
📊 系统健康状态:
• 内存:X GB / Y GB
• 磁盘:Z% 已使用
• 服务:N/N 在线