Complexity Levels
| Level | Tools | Setup Time | Best For |
|---|
| Minimal | UptimeRobot, Healthchecks.io | 15 min | Side projects, MVPs |
| Standard |
Uptime Kuma, Sentry, basic Grafana | 1-2 hours | Small teams, startups |
|
Professional | Prometheus, Grafana, Loki, Alertmanager | 1-2 days | Production systems |
|
Enterprise | Datadog, New Relic, or full OSS stack | Ongoing | Large-scale operations |
The Three Pillars
| Pillar | What It Answers | Tools |
|---|
| Metrics | "How is the system performing?" | Prometheus, Grafana, Datadog |
| Logs |
"What happened?" | Loki, ELK, CloudWatch |
|
Traces | "Why is this request slow?" | Jaeger, Tempo, Sentry |
Quick Start by Use Case
"I just want to know if it's down"
→ UptimeRobot (free) or Uptime Kuma (self-hosted). See simple.md.
"I need to debug production errors"
→ Sentry with your framework SDK. 5-minute setup. See apm.md.
"I want real observability"
→ Prometheus + Grafana + Loki. See prometheus.md.
"I need to centralize logs"
→ Loki for simple, ELK for complex queries. See logs.md.
What to Monitor
Applications (RED Method)
- - Rate — requests per second
- Errors — error rate by endpoint
- Duration — latency (p50, p95, p99)
Infrastructure (USE Method)
- - Utilization — CPU, memory, disk usage
- Saturation — queue depth, load average
- Errors — hardware/system errors
Alerting Principles
| Do | Don't |
|---|
| Alert on symptoms (user impact) | Alert on causes (CPU high) |
| Include runbook link |
Require investigation to understand |
| Set appropriate severity | Make everything P1 |
| Require action | Alert on "interesting" metrics |
Alert fatigue kills monitoring. If alerts are ignored, you have no monitoring.
For alert configuration, severities, and on-call setup, see alerting.md.
Cost Comparison
| Solution | Monthly Cost (small) | Monthly Cost (medium) |
|---|
| UptimeRobot | Free | $7 |
| Uptime Kuma |
$5 (VPS) | $5 (VPS) |
| Sentry | Free / $26 | $80 |
| Grafana Cloud | Free tier | $50+ |
| Datadog | $15/host | $23/host + features |
| Self-hosted stack | $10-20 (VPS) | $50-100 (VPS) |
Common Mistakes
- - Starting with Prometheus/Grafana when Uptime Kuma would suffice
- No alerting (dashboards nobody watches)
- Too many alerts (alert fatigue → ignored)
- Missing runbooks (alert fires, nobody knows what to do)
- Not monitoring from outside (only internal checks)
- Storing logs forever (cost explodes)
复杂度等级
| 等级 | 工具 | 搭建时间 | 适用场景 |
|---|
| 极简 | UptimeRobot、Healthchecks.io | 15分钟 | 副业项目、最小可行产品 |
| 标准 |
Uptime Kuma、Sentry、基础Grafana | 1-2小时 | 小团队、初创公司 |
|
专业 | Prometheus、Grafana、Loki、Alertmanager | 1-2天 | 生产系统 |
|
企业级 | Datadog、New Relic或完整开源栈 | 持续投入 | 大规模运营 |
三大支柱
| 支柱 | 回答的问题 | 工具 |
|---|
| 指标 | 系统运行状况如何? | Prometheus、Grafana、Datadog |
| 日志 |
发生了什么? | Loki、ELK、CloudWatch |
|
链路追踪 | 为什么这个请求很慢? | Jaeger、Tempo、Sentry |
按使用场景快速入门
我只想知道系统是否宕机
→ UptimeRobot(免费)或 Uptime Kuma(自托管)。参见 simple.md。
我需要调试生产环境错误
→ 使用框架SDK集成Sentry。5分钟搭建。参见 apm.md。
我想要真正的可观测性
→ Prometheus + Grafana + Loki。参见 prometheus.md。
我需要集中管理日志
→ 简单场景用Loki,复杂查询用ELK。参见 logs.md。
监控内容
应用(RED方法)
- - Rate — 每秒请求数
- Error — 按端点的错误率
- Duration — 延迟(p50、p95、p99)
基础设施(USE方法)
- - Utilization — CPU、内存、磁盘使用率
- Saturation — 队列深度、平均负载
- Error — 硬件/系统错误
告警原则
| 应该做 | 不应该做 |
|---|
| 对症状告警(用户影响) | 对原因告警(CPU高) |
| 包含操作手册链接 |
需要调查才能理解 |
| 设置适当的严重级别 | 把所有问题设为P1 |
| 需要采取行动 | 对有趣的指标告警 |
告警疲劳会毁掉监控。 如果告警被忽略,就等于没有监控。
关于告警配置、严重级别和值班设置,参见 alerting.md。
成本对比
| 方案 | 月度成本(小型) | 月度成本(中型) |
|---|
| UptimeRobot | 免费 | $7 |
| Uptime Kuma |
$5(VPS) | $5(VPS) |
| Sentry | 免费 / $26 | $80 |
| Grafana Cloud | 免费套餐 | $50+ |
| Datadog | $15/主机 | $23/主机 + 功能费 |
| 自托管方案 | $10-20(VPS) | $50-100(VPS) |
常见错误
- - 在Uptime Kuma已足够时,却从Prometheus/Grafana开始
- 没有告警(无人查看的仪表盘)
- 告警过多(告警疲劳 → 被忽略)
- 缺少操作手册(告警触发,无人知道如何处理)
- 未从外部监控(仅做内部检查)
- 永久存储日志(成本激增)