Logging & Observability
Patterns for building observable systems across the three pillars: logs, metrics, and traces.
Three Pillars
| Pillar | Purpose | Question It Answers | Example |
|---|
| Logs | What happened | Why did this request fail? | INLINECODE0 |
| Metrics |
How much / how fast | Is latency increasing? |
http_request_duration_seconds{route="/api/orders"} 0.342 |
|
Traces | Request flow | Where is the bottleneck? | Span:
api-gateway → auth → order-service → db |
Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.
Structured Logging
Always emit logs as structured JSON — never free-text strings.
Required Fields
| Field | Purpose | Required |
|---|
| INLINECODE4 | ISO-8601 with milliseconds | Yes |
| INLINECODE5 |
Severity (DEBUG … FATAL) | Yes |
|
service | Originating service name | Yes |
|
message | Human-readable description | Yes |
|
trace_id | Distributed trace correlation | Yes |
|
span_id | Current span within trace | Yes |
|
correlation_id | Business-level correlation (order ID) | When applicable |
|
error | Structured error object | On errors |
|
context | Request-specific metadata | Recommended |
Context Enrichment
Attach context at the middleware level so downstream logs inherit automatically:
CODEBLOCK0
Library Recommendations
| Library | Language | Strengths | Perf |
|---|
| Pino | Node.js | Fastest Node logger, low overhead | Excellent |
| structlog |
Python | Composable processors, context binding | Good |
|
zerolog | Go | Zero-allocation JSON logging | Excellent |
|
zap | Go | High performance, typed fields | Excellent |
|
tracing | Rust | Spans + events, async-aware | Excellent |
Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
Log Levels
| Level | When to Use | Example |
|---|
| FATAL | App cannot continue, process will exit | Database connection pool exhausted |
| ERROR |
Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
|
WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout |
|
INFO | Normal business events | Order ORD-1234 placed successfully |
|
DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences |
|
TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |
Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
Distributed Tracing
OpenTelemetry Setup
Always prefer OpenTelemetry over vendor-specific SDKs:
CODEBLOCK1
Span Creation
CODEBLOCK2
Context Propagation
- - Use W3C Trace Context (
traceparent header) — default in OTel - Propagate across HTTP, gRPC, and message queues
- For async workers: serialise
traceparent into the job payload
Trace Sampling
| Strategy | Use When |
|---|
| Always On | Low-traffic services, debugging |
| Probabilistic (N%) |
General production use |
|
Rate-limited (N/sec) | High-throughput services |
|
Tail-based | When you need all error traces |
Always sample 100% of error traces regardless of strategy.
Metrics Collection
RED Method (Request-Driven)
Monitor these three for every service endpoint:
| Metric | What It Measures | Prometheus Example |
|---|
| Rate | Requests/sec | INLINECODE15 |
| Errors |
Failed request ratio |
rate(http_requests_total{status=~"5.."}[5m]) |
|
Duration | Response time |
histogram_quantile(0.99, http_request_duration_seconds) |
USE Method (Resource-Driven)
For infrastructure components (CPU, memory, disk, network):
| Metric | What It Measures | Example |
|---|
| Utilization | % resource busy | CPU usage at 78% |
| Saturation |
Work queued/waiting | 12 requests queued in thread pool |
|
Errors | Error events on resource | 3 disk I/O errors in last minute |
Monitoring Stack
| Tool | Category | Best For |
|---|
| Prometheus | Metrics | Pull-based metrics, alerting rules |
| Grafana |
Visualisation | Dashboards for metrics, logs, traces |
|
Jaeger | Tracing | Distributed trace visualisation |
|
Loki | Logs | Log aggregation (pairs with Grafana) |
|
OpenTelemetry | Collection | Vendor-neutral telemetry collection |
Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
Alert Design
Severity Levels
| Severity | Response Time | Example |
|---|
| P1 | Immediate | Service fully down, data loss |
| P2 |
< 30 min | Error rate > 5%, latency p99 > 5s |
|
P3 | Business hours | Disk > 80%, cert expiring in 7 days |
|
P4 | Best effort | Non-critical deprecation warning |
Alert Fatigue Prevention
- - Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
- Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
- Require runbook links — every alert must link to diagnosis and remediation
- Review monthly — delete or tune alerts that never fire or always fire
- Group related alerts — use inhibition rules to suppress child alerts
- Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete
Dashboard Patterns
Overview Dashboard ("War Room")
- - Total requests/sec across all services
- Global error rate (%) with trendline
- p50 / p95 / p99 latency
- Active alerts count by severity
- Deployment markers overlaid on graphs
Service Dashboard (Per-Service)
- - RED metrics for each endpoint
- Dependency health (upstream/downstream success rates)
- Resource utilisation (CPU, memory, connections)
- Top errors table with count and last seen
Observability Checklist
Every service must have:
- - [ ] Structured JSON logging with consistent schema
- [ ] Correlation / trace IDs propagated on all requests
- [ ] RED metrics exposed for every external endpoint
- [ ] Health check endpoints (
/healthz and /readyz) - [ ] Distributed tracing with OpenTelemetry
- [ ] Dashboards for RED metrics and resource utilisation
- [ ] Alerts for error rate, latency, and saturation with runbook links
- [ ] Log level configurable at runtime without redeployment
- [ ] PII scrubbing verified and tested
- [ ] Retention policies defined for logs, metrics, and traces
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging |
Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |
NEVER Do
- 1. NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
- NEVER use console.log / print in production — use a structured logger
- NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
- NEVER create alerts without a runbook link — unactionable alerts erode trust
- NEVER rely on logs alone — you need metrics and traces for full observability
- NEVER log request/response bodies by default — opt-in only, with PII redaction
- NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
- NEVER skip context propagation in async flows — broken traces are worse than no traces
日志与可观测性
构建可观测系统的三大支柱模式:日志、指标和链路追踪。
三大支柱
| 支柱 | 用途 | 回答的问题 | 示例 |
|---|
| 日志 | 发生了什么 | 这个请求为什么失败? | {level:error,msg:payment declined,userid:u82} |
| 指标 |
多少/多快 | 延迟在增加吗? | http
requestduration_seconds{route=/api/orders} 0.342 |
|
链路追踪 | 请求流程 | 瓶颈在哪里? | Span: api-gateway → auth → order-service → db |
每个支柱在关联时最强。在每行日志中嵌入 trace_id,以便从日志条目跳转到完整的分布式链路追踪。
结构化日志
始终以结构化JSON格式输出日志——绝不使用自由文本字符串。
必填字段
| 字段 | 用途 | 是否必填 |
|---|
| timestamp | ISO-8601格式,含毫秒 | 是 |
| level |
严重级别(DEBUG … FATAL) | 是 |
| service | 来源服务名称 | 是 |
| message | 人类可读的描述 | 是 |
| trace_id | 分布式链路追踪关联ID | 是 |
| span_id | 当前追踪中的Span ID | 是 |
| correlation_id | 业务级关联ID(订单ID) | 适用时 |
| error | 结构化错误对象 | 出错时 |
| context | 请求特定元数据 | 推荐 |
上下文增强
在中间件层附加上下文,使下游日志自动继承:
typescript
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers[x-trace-id] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
user_id: req.user?.id,
method: req.method,
path: req.path,
};
asyncLocalStorage.run(ctx, () => next());
});
库推荐
| 库 | 语言 | 优势 | 性能 |
|---|
| Pino | Node.js | 最快的Node日志器,低开销 | 优秀 |
| structlog |
Python | 可组合处理器,上下文绑定 | 良好 |
|
zerolog | Go | 零分配JSON日志 | 优秀 |
|
zap | Go | 高性能,类型化字段 | 优秀 |
|
tracing | Rust | Spans + 事件,异步感知 | 优秀 |
选择原生输出结构化JSON的日志器。避免需要后处理的日志器。
日志级别
| 级别 | 使用时机 | 示例 |
|---|
| FATAL | 应用无法继续,进程将退出 | 数据库连接池耗尽 |
| ERROR |
操作失败,需要关注 | 支付扣款失败:CARD_DECLINED |
|
WARN | 意外但可恢复 | 上游超时重试2/3 |
|
INFO | 正常业务事件 | 订单ORD-1234成功下单 |
|
DEBUG | 开发者调试 | 用户:82:preferences的缓存未命中 |
|
TRACE | 非常细粒度(生产环境很少使用) | 进入validateAddress,携带payload |
规则: 生产环境默认 = INFO及以上。如果记录ERROR,应有人处理。每个FATAL都应触发告警。
分布式链路追踪
OpenTelemetry 设置
始终优先使用OpenTelemetry而非供应商特定SDK:
typescript
import { NodeSDK } from @opentelemetry/sdk-node;
import { OTLPTraceExporter } from @opentelemetry/exporter-trace-otlp-http;
import { getNodeAutoInstrumentations } from @opentelemetry/auto-instrumentations-node;
const sdk = new NodeSDK({
serviceName: order-service,
traceExporter: new OTLPTraceExporter({
url: http://otel-collector:4318/v1/traces,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Span 创建
typescript
const tracer = trace.getTracer(order-service);
async function processOrder(order: Order) {
return tracer.startActiveSpan(processOrder, async (span) => {
try {
span.setAttribute(order.id, order.id);
span.setAttribute(order.total_cents, order.totalCents);
await validateInventory(order);
await chargePayment(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
上下文传播
- - 使用W3C Trace Context(traceparent头)——OTel默认
- 跨HTTP、gRPC和消息队列传播
- 对于异步工作者:将traceparent序列化到任务负载中
链路采样
| 策略 | 使用场景 |
|---|
| 始终开启 | 低流量服务、调试 |
| 概率采样(N%) |
通用生产环境 |
|
速率限制(N/秒) | 高吞吐量服务 |
|
基于尾部的采样 | 需要所有错误链路时 |
无论采用何种策略,始终对错误链路进行100%采样。
指标收集
RED方法(请求驱动)
对每个服务端点监控以下三项:
| 指标 | 衡量内容 | Prometheus示例 |
|---|
| 速率 | 请求数/秒 | rate(httprequeststotal[5m]) |
| 错误 |
失败请求比例 | rate(http
requeststotal{status=~5..}[5m]) |
|
持续时间 | 响应时间 | histogram
quantile(0.99, httprequest
durationseconds) |
USE方法(资源驱动)
对于基础设施组件(CPU、内存、磁盘、网络):
| 指标 | 衡量内容 | 示例 |
|---|
| 利用率 | 资源繁忙百分比 | CPU使用率78% |
| 饱和度 |
排队/等待的工作 | 线程池中12个请求排队 |
|
错误 | 资源上的错误事件 | 最近1分钟3个磁盘I/O错误 |
监控栈
| 工具 | 类别 | 最佳用途 |
|---|
| Prometheus | 指标 | 基于拉取的指标、告警规则 |
| Grafana |
可视化 | 指标、日志、链路追踪仪表盘 |
|
Jaeger | 链路追踪 | 分布式链路可视化 |
|
Loki | 日志 | 日志聚合(与Grafana配合) |
|
OpenTelemetry | 收集 | 供应商中立的遥测数据收集 |
推荐: 从OTel Collector → Prometheus + Grafana + Loki + Jaeger开始。仅在运维开销证明成本合理时迁移到SaaS。
告警设计
严重级别
| 严重级别 | 响应时间 | 示例 |
|---|
| P1 | 立即 | 服务完全宕机、数据丢失 |
| P2 |
< 30分钟 | 错误率 > 5%、延迟p99 > 5秒 |
|
P3 | 工作时间 | 磁盘 > 80%、证书7天内过期 |
|
P4 | 尽力而为 | 非关键弃用警告 |
告警疲劳预防
- - 对症状告警,而非原因——错误率 > 5%而非Pod重启
- 多窗口、多燃烧率——同时捕获突发峰值和缓慢燃烧
- 需要runbook链接——每个告警必须链接到诊断和修复方案
- 每月审查——删除或调整从不触发或始终触发的告警
- 分组相关告警——使用抑制规则压制子告警
- 设置适当阈值——如果告警每天触发且被忽略,提高阈值或删除
仪表盘模式
概览仪表盘(作战室)
- - 所有服务的总请求数/秒
- 全局错误率(%)及趋势线
- p50 / p95 / p99 延迟
- 按严重级别统计的活跃告警数
- 图表上叠加的部署标记
服务仪表盘(每个服务)
- - 每个端点的RED指标
- 依赖健康状态(上游/下游成功率)
- 资源利用率(CPU、内存、连接数)
- 按计数和最后出现时间排序的顶部错误表
可观测性检查清单
每个服务必须具有: