Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

Pillar	Purpose	Question It Answers	Example
Logs	What happened	Why did this request fail?	INLINECODE0
Metrics

How much / how fast | Is latency increasing? | http_request_duration_seconds{route="/api/orders"} 0.342 | | Traces | Request flow | Where is the bottleneck? | Span: api-gateway → auth → order-service → db |

Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.

Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

Field	Purpose	Required
INLINECODE4	ISO-8601 with milliseconds	Yes
INLINECODE5

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

CODEBLOCK0

Library Recommendations

Library	Language	Strengths	Perf
Pino	Node.js	Fastest Node logger, low overhead	Excellent
structlog

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

Log Levels

Level	When to Use	Example
FATAL	App cannot continue, process will exit	Database connection pool exhausted
ERROR

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

CODEBLOCK1

Span Creation

CODEBLOCK2

Context Propagation

- Use W3C Trace Context (traceparent header) — default in OTel
Propagate across HTTP, gRPC, and message queues
For async workers: serialise traceparent into the job payload

Trace Sampling

Strategy	Use When
Always On	Low-traffic services, debugging
Probabilistic (N%)

Always sample 100% of error traces regardless of strategy.

Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

Metric	What It Measures	Prometheus Example
Rate	Requests/sec	INLINECODE15
Errors

Failed request ratio | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Response time | histogram_quantile(0.99, http_request_duration_seconds) |

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

Metric	What It Measures	Example
Utilization	% resource busy	CPU usage at 78%
Saturation

Monitoring Stack

Tool	Category	Best For
Prometheus	Metrics	Pull-based metrics, alerting rules
Grafana

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

Alert Design

Severity Levels

Severity	Response Time	Example
P1	Immediate	Service fully down, data loss
P2

Alert Fatigue Prevention

- Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
Require runbook links — every alert must link to diagnosis and remediation
Review monthly — delete or tune alerts that never fire or always fire
Group related alerts — use inhibition rules to suppress child alerts
Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

Dashboard Patterns

Overview Dashboard ("War Room")

- Total requests/sec across all services
Global error rate (%) with trendline
p50 / p95 / p99 latency
Active alerts count by severity
Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

- RED metrics for each endpoint
Dependency health (upstream/downstream success rates)
Resource utilisation (CPU, memory, connections)
Top errors table with count and last seen

Observability Checklist

Every service must have:

- [ ] Structured JSON logging with consistent schema
[ ] Correlation / trace IDs propagated on all requests
[ ] RED metrics exposed for every external endpoint
[ ] Health check endpoints (/healthz and /readyz)
[ ] Distributed tracing with OpenTelemetry
[ ] Dashboards for RED metrics and resource utilisation
[ ] Alerts for error rate, latency, and saturation with runbook links
[ ] Log level configurable at runtime without redeployment
[ ] PII scrubbing verified and tested
[ ] Retention policies defined for logs, metrics, and traces

Anti-Patterns

Anti-Pattern	Problem	Fix
Logging PII	Privacy/compliance violation	Mask or exclude PII; use token references
Excessive logging

NEVER Do

1. NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
NEVER use console.log / print in production — use a structured logger
NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
NEVER create alerts without a runbook link — unactionable alerts erode trust
NEVER rely on logs alone — you need metrics and traces for full observability
NEVER log request/response bodies by default — opt-in only, with PII redaction
NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
NEVER skip context propagation in async flows — broken traces are worse than no traces

日志与可观测性

构建可观测系统的三大支柱模式：日志、指标和链路追踪。

三大支柱

支柱	用途	回答的问题	示例
日志	发生了什么	这个请求为什么失败？	{level:error,msg:payment declined,userid:u82}
指标

多少/多快 | 延迟在增加吗？ | httprequestduration_seconds{route=/api/orders} 0.342 | | 链路追踪 | 请求流程 | 瓶颈在哪里？ | Span: api-gateway → auth → order-service → db |

每个支柱在关联时最强。在每行日志中嵌入 trace_id，以便从日志条目跳转到完整的分布式链路追踪。

结构化日志

始终以结构化JSON格式输出日志——绝不使用自由文本字符串。

必填字段

字段	用途	是否必填
timestamp	ISO-8601格式，含毫秒	是
level

上下文增强

在中间件层附加上下文，使下游日志自动继承：

typescript
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers[x-trace-id] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
user_id: req.user?.id,
method: req.method,
path: req.path,
};
asyncLocalStorage.run(ctx, () => next());
});

库推荐

库	语言	优势	性能
Pino	Node.js	最快的Node日志器，低开销	优秀
structlog

Python | 可组合处理器，上下文绑定 | 良好 | | zerolog | Go | 零分配JSON日志 | 优秀 | | zap | Go | 高性能，类型化字段 | 优秀 | | tracing | Rust | Spans + 事件，异步感知 | 优秀 |

选择原生输出结构化JSON的日志器。避免需要后处理的日志器。

日志级别

级别	使用时机	示例
FATAL	应用无法继续，进程将退出	数据库连接池耗尽
ERROR

规则： 生产环境默认 = INFO及以上。如果记录ERROR，应有人处理。每个FATAL都应触发告警。

分布式链路追踪

OpenTelemetry 设置

始终优先使用OpenTelemetry而非供应商特定SDK：

typescript
import { NodeSDK } from @opentelemetry/sdk-node;
import { OTLPTraceExporter } from @opentelemetry/exporter-trace-otlp-http;
import { getNodeAutoInstrumentations } from @opentelemetry/auto-instrumentations-node;

const sdk = new NodeSDK({
serviceName: order-service,
traceExporter: new OTLPTraceExporter({
url: http://otel-collector:4318/v1/traces,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Span 创建

typescript
const tracer = trace.getTracer(order-service);

async function processOrder(order: Order) {
return tracer.startActiveSpan(processOrder, async (span) => {
try {
span.setAttribute(order.id, order.id);
span.setAttribute(order.total_cents, order.totalCents);
await validateInventory(order);
await chargePayment(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}

上下文传播

- 使用W3C Trace Context（traceparent头）——OTel默认
跨HTTP、gRPC和消息队列传播
对于异步工作者：将traceparent序列化到任务负载中

链路采样

策略	使用场景
始终开启	低流量服务、调试
概率采样（N%）

无论采用何种策略，始终对错误链路进行100%采样。

指标收集

RED方法（请求驱动）

对每个服务端点监控以下三项：

指标	衡量内容	Prometheus示例
速率	请求数/秒	rate(httprequeststotal[5m])
错误

失败请求比例 | rate(httprequeststotal{status=~5..}[5m]) |
| 持续时间 | 响应时间 | histogramquantile(0.99, httprequestdurationseconds) |

USE方法（资源驱动）

对于基础设施组件（CPU、内存、磁盘、网络）：

指标	衡量内容	示例
利用率	资源繁忙百分比	CPU使用率78%
饱和度

监控栈

工具	类别	最佳用途
Prometheus	指标	基于拉取的指标、告警规则
Grafana

推荐： 从OTel Collector → Prometheus + Grafana + Loki + Jaeger开始。仅在运维开销证明成本合理时迁移到SaaS。

告警设计

严重级别

严重级别	响应时间	示例
P1	立即	服务完全宕机、数据丢失
P2

告警疲劳预防

- 对症状告警，而非原因——错误率 > 5%而非Pod重启
多窗口、多燃烧率——同时捕获突发峰值和缓慢燃烧
需要runbook链接——每个告警必须链接到诊断和修复方案
每月审查——删除或调整从不触发或始终触发的告警
分组相关告警——使用抑制规则压制子告警
设置适当阈值——如果告警每天触发且被忽略，提高阈值或删除

仪表盘模式

概览仪表盘（作战室）

- 所有服务的总请求数/秒
全局错误率（%）及趋势线
p50 / p95 / p99 延迟
按严重级别统计的活跃告警数
图表上叠加的部署标记

服务仪表盘（每个服务）

- 每个端点的RED指标
依赖健康状态（上游/下游成功率）
资源利用率（CPU、内存、连接数）
按计数和最后出现时间排序的顶部错误表

可观测性检查清单

每个服务必须具有：

- [ ] 结构化JSON日志

logging-observability日志可观测性