Monitoring

Complexity Levels

Level	Tools	Setup Time	Best For
Minimal	UptimeRobot, Healthchecks.io	15 min	Side projects, MVPs
Standard

The Three Pillars

Pillar	What It Answers	Tools
Metrics	"How is the system performing?"	Prometheus, Grafana, Datadog
Logs

Quick Start by Use Case

"I just want to know if it's down"
→ UptimeRobot (free) or Uptime Kuma (self-hosted). See simple.md.

"I need to debug production errors"
→ Sentry with your framework SDK. 5-minute setup. See apm.md.

"I want real observability"
→ Prometheus + Grafana + Loki. See prometheus.md.

"I need to centralize logs"
→ Loki for simple, ELK for complex queries. See logs.md.

What to Monitor

Applications (RED Method)

- Rate — requests per second
Errors — error rate by endpoint
Duration — latency (p50, p95, p99)

Infrastructure (USE Method)

- Utilization — CPU, memory, disk usage
Saturation — queue depth, load average
Errors — hardware/system errors

Alerting Principles

Do	Don't
Alert on symptoms (user impact)	Alert on causes (CPU high)
Include runbook link

Alert fatigue kills monitoring. If alerts are ignored, you have no monitoring.

For alert configuration, severities, and on-call setup, see alerting.md.

Cost Comparison

Solution	Monthly Cost (small)	Monthly Cost (medium)
UptimeRobot	Free	$7
Uptime Kuma

$5 (VPS) | $5 (VPS) | | Sentry | Free / $26 | $80 | | Grafana Cloud | Free tier | $50+ | | Datadog | $15/host | $23/host + features | | Self-hosted stack | $10-20 (VPS) | $50-100 (VPS) |

Common Mistakes

- Starting with Prometheus/Grafana when Uptime Kuma would suffice
No alerting (dashboards nobody watches)
Too many alerts (alert fatigue → ignored)
Missing runbooks (alert fires, nobody knows what to do)
Not monitoring from outside (only internal checks)
Storing logs forever (cost explodes)

复杂度等级

等级	工具	搭建时间	适用场景
极简	UptimeRobot、Healthchecks.io	15分钟	副业项目、最小可行产品
标准

三大支柱

支柱	回答的问题	工具
指标	系统运行状况如何？	Prometheus、Grafana、Datadog
日志

按使用场景快速入门

我只想知道系统是否宕机
→ UptimeRobot（免费）或 Uptime Kuma（自托管）。参见 simple.md。

我需要调试生产环境错误
→ 使用框架SDK集成Sentry。5分钟搭建。参见 apm.md。

我想要真正的可观测性
→ Prometheus + Grafana + Loki。参见 prometheus.md。

我需要集中管理日志
→ 简单场景用Loki，复杂查询用ELK。参见 logs.md。

监控内容

应用（RED方法）

- Rate — 每秒请求数
Error — 按端点的错误率
Duration — 延迟（p50、p95、p99）

基础设施（USE方法）

- Utilization — CPU、内存、磁盘使用率
Saturation — 队列深度、平均负载
Error — 硬件/系统错误

告警原则

应该做	不应该做
对症状告警（用户影响）	对原因告警（CPU高）
包含操作手册链接

告警疲劳会毁掉监控。 如果告警被忽略，就等于没有监控。

关于告警配置、严重级别和值班设置，参见 alerting.md。

成本对比

方案	月度成本（小型）	月度成本（中型）
UptimeRobot	免费	$7
Uptime Kuma

$5（VPS） | $5（VPS） | | Sentry | 免费 / $26 | $80 | | Grafana Cloud | 免费套餐 | $50+ | | Datadog | $15/主机 | $23/主机 + 功能费 | | 自托管方案 | $10-20（VPS） | $50-100（VPS） |

常见错误

- 在Uptime Kuma已足够时，却从Prometheus/Grafana开始
没有告警（无人查看的仪表盘）
告警过多（告警疲劳 → 被忽略）
缺少操作手册（告警触发，无人知道如何处理）
未从外部监控（仅做内部检查）
永久存储日志（成本激增）

Monitoring监控

Complexity Levels

The Three Pillars

Quick Start by Use Case