Server Mate
Version: INLINECODE0
Use this skill to design or implement a two-plane monitoring system:
- - a Python agent on the server that tails logs and samples host metrics
- an OpenClaw-side analyzer that aggregates data, explains failures, answers questions, and sends alerts
Start
- - Confirm the environment first: Linux distribution, Nginx or Apache, PHP-FPM layout, log paths, webhook target, and whether automated actions may touch a live host.
- Keep collection read-only until the user explicitly asks for automation. Add alerting before any auto-ban or auto-heal behavior.
- In OpenClaw deployments,
OPENAI_API_KEY is injected by the runtime when AI analysis is enabled. Do not ask the user to export it manually. Treat webhook URLs or tokens in config.yaml as secrets and do not commit them. - Treat
./data/GeoIP.conf the same way. It may contain MaxMind AccountID and LicenseKey, so keep it local-only and out of Git. - Prefer MaxMind's official GeoLite2 workflow through
./data/GeoIP.conf and geoipupdate. Treat the built-in public mirror fallback only as an operator-reviewed bootstrap path when no local .mmdb file is present. - Treat auto-ban and auto-heal as privileged features. They may execute operator-supplied firewall or service restart commands and should stay disabled or
dry_run: true until reviewed. - Use the references progressively instead of loading everything at once:
- Read
references/architecture.md for overall design, component boundaries, and rollout order.
- Read
references/data-contracts.md before defining JSON payloads, storage schemas, metrics, or natural-language query handlers.
- Read
references/ops-playbook.md before implementing thresholds, webhooks, reports, auto-ban, or self-heal logic.
- Read
references/sqlite-schema.md before extending historical storage or report queries.
- Use
scripts/server_agent.py as the collector, daemon entrypoint, and SQLite rollup writer.
Delivery workflow
- 1. Map the request to one or more tracks.
- Agent collection
- Aggregation and storage
- Alerting and reporting
- AI diagnosis
- Guarded remediation
- 2. Implement the smallest safe slice first.
- Start with structured access, error, and system events.
- Add rollup metrics and natural-language answers next.
- Add webhook alerts after the counters are stable.
- Enable auto-ban or auto-heal only when thresholds, cooldowns, allowlists, and audit logs already exist.
- 3. Validate with real or synthetic logs before changing production services.
- Explain caveats in plain language.
- Example: UV is often an approximation based on IP and user-agent unless the site provides a stronger visitor key.
- Example: upload bandwidth is unavailable unless the access log includes request length or a similar field.
Agent rules
- - Prefer Python,
psutil, and the standard library for the first implementation. - Prefer a generated
./config.yaml plus local SQLite state such as ./metrics.db before adding external services. - Keep generated artifacts inside the current skill workspace by default:
./config.yaml, ./metrics.db, ./logs/, and ./reports/. Do not default to /opt, /var/log, or other system-wide directories. - Prefer the
system_metrics + sites[] matrix layout from config.example.yaml instead of new single-site keys. - Support configurable log paths. Do not hardcode site layouts when the vhost config can be read instead.
- Emit structured JSON with timezone-aware timestamps, host or site identifiers, event type, and enough raw context to debug parser mistakes.
- In multi-site mode, collect host CPU or memory metrics once per cycle and keep site log parsing isolated per domain.
- Separate parsing, aggregation, transport, and action execution so that HTTP push, stdout replay, file drop, or websocket transport can be swapped independently.
- Keep unknown lines and parser failures as first-class counters instead of dropping them silently.
Analyzer rules
- - Store raw events separately from derived counters.
- Model traffic, performance, security, spider, and error signals as independent reducers over the same event stream.
- Translate natural-language requests into:
- a time window
- filters
- an aggregation
- a presentation format
- - For AI error explanations, pass the fingerprint, surrounding context, and normalized fields instead of dumping entire logs.
Safety rules
- - Treat auto-ban and auto-heal as opt-in features.
- Default Guarded Automation to
dry_run: true and keep it there until the user has observed automation notifications and audit history for several days. - Never flip
dry_run to false, or enable auto_ban.enabled / auto_heal.enabled, unless the operator explicitly approves the command templates, allowlists, cooldowns, and audit destinations. - Require cooldowns, max actions per window, and allowlists before running firewall or restart commands.
- Require whitelist checks before any ban command. Never ban loopback, RFC1918 private ranges, or trusted crawler families by default.
- Require TTL-based unban or an equivalent release plan for every ban. Do not create permanent firewall blocks from the first implementation.
- Record an audit event for every alert, dry-run, ban, unban, restart, and failed remediation attempt.
- Store audit history in SQLite tables such as
automation_actions and banned_ips, and expose simple lookup queries in user-facing docs. - Prefer one-shot remediation followed by escalation. Do not loop restarts.
Report expectations
- - Daily report: prior-day PV, UV, IP, request totals, bandwidth, status mix, top errors, and slow endpoints.
- Weekly report: blocked IP trends, crawler trends, suspicious route clusters, and recurring slow routes.
- Monthly report: bandwidth peak, disk growth, capacity warning, and remediation summary.
Automation scheduling
Use external scheduling for production unless the user explicitly wants an always-on daemon-only design.
- - Recommended ingestion pattern:
- Run
server_agent.py --once every 10 minutes from
cron or a
systemd timer.
- This keeps log parsing incremental, writes SQLite rollups, and avoids duplicate resident processes.
- - For
systemd deployments in Clawhub-style packaging:
- Do not rely on bundling a
.service file inside the skill package.
- Generate a host-local unit with
server_agent.py --config ./config.yaml --generate-service, then paste it into
/etc/systemd/system/server-mate.service.
- - Recommended report pattern:
- Run
report_generator.py as one-shot scheduled jobs.
- Daily PDF push at
01:00.
- Weekly PDF push every Monday at
01:10.
- Monthly PDF push on day
1 at
01:20.
- - In multi-site mode, a single scheduled
report_generator.py run should iterate over every configured site unless the user explicitly passes --site.
Release notes for 1.3.2
- - Multi-site matrix config using
sites[] plus global INLINECODE42 - Host-global metrics stored separately from site-local business rollups
- Logrotate-tolerant incremental readers with inode or truncate recovery
- Guarded Automation with
dry_run, whitelist checks, TTL-based unban, cooldown-based auto-heal, and SQLite audit trail - SSH brute-force detection from
logs.auth_log with ssh_brute_force alerting and optional linked auto-ban - SSL certificate expiry inspection in report generation and webhook summaries
- Telegram delivery support for alerts and report notices
- GeoIP official refresh support via local
./data/GeoIP.conf and geoipupdate, with an operator-reviewed public mirror bootstrap fallback - INLINECODE48 and docs updated for MaxMind GeoLite2 setup in the current workspace
Copyable cron examples:
CODEBLOCK0
Systemd note:
- - If the host already standardizes on
systemd, prefer Type=oneshot services plus timers for reports. - Use
Restart=always only for the long-running --daemon agent mode.
Example requests
- - "Design the ingestion API for Server-Mate."
- "Add 404 burst detection and webhook alerts."
- "Explain today's top 5xx error in plain language."
- "Plan a safe auto-heal flow for repeated 502 responses."
Server Mate
版本:1.3.3
使用此技能设计或实现一个双平面监控系统:
- - 服务器上的 Python 代理,用于追踪日志并采集主机指标
- OpenClaw 侧的分析器,用于聚合数据、解释故障、回答问题并发送告警
开始
- - 首先确认环境:Linux 发行版、Nginx 或 Apache、PHP-FPM 布局、日志路径、Webhook 目标,以及自动化操作是否会触及生产主机。
- 保持采集为只读状态,直到用户明确要求自动化。在任何自动封禁或自动修复行为之前,先添加告警功能。
- 在 OpenClaw 部署中,当启用 AI 分析时,OPENAIAPIKEY 由运行时注入。不要要求用户手动导出。将 config.yaml 中的 Webhook URL 或令牌视为机密信息,不要提交。
- 同样对待 ./data/GeoIP.conf。它可能包含 MaxMind 的 AccountID 和 LicenseKey,因此保持本地私有,不要纳入 Git。
- 优先通过 ./data/GeoIP.conf 和 geoipupdate 使用 MaxMind 官方的 GeoLite2 工作流程。仅在没有本地 .mmdb 文件时,将内置的公共镜像回退视为需由操作员审查的引导路径。
- 将自动封禁和自动修复视为特权功能。它们可能执行操作员提供的防火墙或服务重启命令,在审查完成前应保持禁用或 dry_run: true 状态。
- 渐进式使用参考资料,而不是一次性加载全部内容:
- 阅读
references/architecture.md 了解整体设计、组件边界和部署顺序。
- 在定义 JSON 负载、存储模式、指标或自然语言查询处理器之前,阅读
references/data-contracts.md。
- 在实现阈值、Webhook、报告、自动封禁或自愈逻辑之前,阅读
references/ops-playbook.md。
- 在扩展历史存储或报告查询之前,阅读
references/sqlite-schema.md。
- 使用
scripts/server_agent.py 作为采集器、守护进程入口点和 SQLite 汇总写入器。
交付工作流
- 1. 将请求映射到一个或多个轨道。
- 代理采集
- 聚合与存储
- 告警与报告
- AI 诊断
- 受保护的修复
- 2. 首先实现最小的安全切片。
- 从结构化访问、错误和系统事件开始。
- 接下来添加汇总指标和自然语言回答。
- 在计数器稳定后添加 Webhook 告警。
- 仅在阈值、冷却期、白名单和审计日志已存在时,才启用自动封禁或自动修复。
- 3. 在更改生产服务之前,使用真实或合成日志进行验证。
- 用通俗语言解释注意事项。
- 示例:UV 通常是基于 IP 和用户代理的近似值,除非站点提供更强的访客标识。
- 示例:上传带宽不可用,除非访问日志包含请求长度或类似字段。
代理规则
- - 首次实现优先使用 Python、psutil 和标准库。
- 在添加外部服务之前,优先使用生成的 ./config.yaml 加上本地 SQLite 状态,如 ./metrics.db。
- 默认将生成的文件保留在当前技能工作空间内:./config.yaml、./metrics.db、./logs/ 和 ./reports/。不要默认使用 /opt、/var/log 或其他系统级目录。
- 优先使用 config.example.yaml 中的 systemmetrics + sites[] 矩阵布局,而不是新的单站点键。
- 支持可配置的日志路径。当可以读取虚拟主机配置时,不要硬编码站点布局。
- 输出结构化 JSON,包含时区感知的时间戳、主机或站点标识符、事件类型,以及足够的原始上下文以调试解析器错误。
- 在多站点模式下,每个周期采集一次主机 CPU 或内存指标,并按域名隔离站点日志解析。
- 将解析、聚合、传输和操作执行分离,以便 HTTP 推送、标准输出重放、文件投递或 WebSocket 传输可以独立替换。
- 将未知行和解析器失败作为一级计数器保留,而不是静默丢弃。
分析器规则
- - 将原始事件与派生计数器分开存储。
- 将流量、性能、安全、爬虫和错误信号建模为同一事件流上的独立归约器。
- 将自然语言请求转换为:
- 时间窗口
- 过滤器
- 聚合方式
- 展示格式
- - 对于 AI 错误解释,传递指纹、上下文和规范化字段,而不是转储整个日志。
安全规则
- - 将自动封禁和自动修复视为选择加入功能。
- 默认将受保护的自动化设置为 dryrun: true,并保持此状态,直到用户观察自动化通知和审计历史数天。
- 除非操作员明确批准命令模板、白名单、冷却期和审计目标,否则切勿将 dryrun 设置为 false,或启用 autoban.enabled / autoheal.enabled。
- 在运行防火墙或重启命令之前,需要冷却期、每个窗口的最大操作次数和白名单。
- 在任何封禁命令之前需要白名单检查。默认不要封禁回环地址、RFC1918 私有范围或受信任的爬虫家族。
- 每次封禁都需要基于 TTL 的解封或等效的释放计划。不要在首次实现中创建永久防火墙规则。
- 为每次告警、试运行、封禁、解封、重启和失败的修复尝试记录审计事件。
- 将审计历史存储在 SQLite 表中,如 automationactions 和 bannedips,并在面向用户的文档中提供简单的查询接口。
- 优先使用一次性修复后升级处理。不要循环重启。
报告预期
- - 日报:前一天 PV、UV、IP、请求总数、带宽、状态码分布、主要错误和慢端点。
- 周报:被封禁 IP 趋势、爬虫趋势、可疑路由集群和重复出现的慢路由。
- 月报:带宽峰值、磁盘增长、容量警告和修复摘要。
自动化调度
除非用户明确希望始终在线的纯守护进程设计,否则生产环境使用外部调度。
- 每 10 分钟通过 cron 或 systemd 定时器 运行 server_agent.py --once。
- 这保持日志解析增量进行,写入 SQLite 汇总,并避免重复的常驻进程。
- - 对于 Clawhub 风格打包的 systemd 部署:
- 不要依赖在技能包内捆绑 .service 文件。
- 使用 server_agent.py --config ./config.yaml --generate-service 生成主机本地单元,然后粘贴到 /etc/systemd/system/server-mate.service。
- 将 report_generator.py 作为一次性定时任务运行。
- 每日 PDF 推送在 01:00。
- 每周 PDF 推送在每周一 01:10。
- 每月 PDF 推送在每月 1 日 01:20。
- - 在多站点模式下,除非用户显式传递 --site,否则单次调度的 report_generator.py 运行应遍历每个配置的站点。
1.3.2 版本发布说明
- - 使用 sites[] 加全局 systemmetrics 的多站点矩阵配置
- 主机全局指标与站点本地业务汇总分开存储
- 支持日志轮转的增量读取器,具有 inode 或截断恢复功能
- 受保护的自动化,包含 dryrun、白名单检查、基于 TTL 的解封、基于冷却期的自动修复和 SQLite 审计跟踪
- 从 logs.authlog 检测 SSH 暴力破解,支持 sshbrute_force 告警和可选的关联自动封禁
- 报告生成和 Webhook 摘要中的 SSL 证书过期检查
- 支持 Telegram 投递告警和报告通知
- 通过本地 ./data/GeoIP.conf 和 geoipupdate 支持 GeoIP 官方刷新,并具有操作员审查的公共镜像引导回退
- 更新了 config.example.yaml 和文档,支持在当前工作空间设置 MaxMind GeoLite2
可复制的 cron 示例:
cron
/10 * /usr/bin/env bash -lc python3 ./scripts/server_agent.py --config ./config.yaml --once >> ./logs/server-mate-agent.log 2>&1
0 1 * /usr/bin/env bash -lc python3 ./scripts/report_generator.py --config ./config.yaml pdf --range daily --send >> ./logs/server-mate-report