Distributed Tracing (Deep Workflow)
Traces answer which hop consumed time and where errors surfaced across services. Success requires consistent propagation, meaningful spans, and sampling that preserves signal without bankrupting storage.
When to Offer This Workflow
Trigger conditions:
- - Microservices “unknown latency” between A and B
- Adopting OpenTelemetry, Jaeger, Zipkin, X-Ray, Cloud Trace
- Need service map and dependency insights
- High cardinality or cost concerns from traces
Initial offer:
Use six stages: (1) define goals & SLOs, (2) instrumentation plan, (3) propagation & context, (4) sampling strategy, (5) analysis workflows, (6) governance & cost. Confirm languages and infra (K8s, service mesh).
Stage 1: Goals & SLOs
Goal: Know why tracing exists—latency, errors, dependency discovery, or customer journey mapping.
Questions
- 1. Top p95/p99 pain routes?
- Compliance or PII constraints on span attributes?
- Cardinality tolerance—user IDs on every span?
Exit condition: Success metrics: e.g., “reduce unknown time in checkout to <5% of trace duration.”
Stage 2: Instrumentation Plan
Goal: Spanness where it helps—not every function.
Layers
- - HTTP server middleware: span per request, route name normalized
- HTTP clients: outgoing spans with peer service
- DB: client spans with statement type—not raw SQL text in prod by default
- Queues: produce/consume spans with message correlation
- Background jobs: separate spans with job type
Naming
- - Span names stable (
GET /orders/{id} patterns) vs high-cardinality raw paths
Attributes
- - service.name, deployment.environment, http.status_code, db.system—follow semantic conventions (OTel)
Exit condition: Inventory of frameworks auto-instrumented vs manual spans needed.
Stage 3: Propagation & Context
Goal: Trace ID crosses async boundaries—no broken traces.
Practices
- - W3C Trace Context headers for HTTP; messaging propagators for Kafka/AMQP
- Async tasks: attach context when scheduling (executor,
asyncio, Promise) - Batch processing: link spans or baggage carefully—avoid leaking PII
Service mesh
- - Sidecar tracing vs library tracing—avoid double counting; configure one source of truth
Exit condition: Broken trace rate measurable; top 5 causes documented (missing propagation, etc.).
Stage 4: Sampling Strategy
Goal: Representative traces without storing everything.
Head-based
- - Fixed percentage; always sample errors (tail sampling often still needed)
Tail-based
- - Interesting traces (high latency, errors) retained—complexity but better signal
Cost controls
- - Attribute limits; span limits per trace; drop health checks
Exit condition: Written policy: baseline rate + error always + latency outliers.
Stage 5: Analysis Workflows
Goal: Engineers use traces in incidents and perf work.
Workflows
- - Trace view: critical path, longest child span
- Compare releases: same route, different p99 span
- Service map from edges—validate unexpected dependencies
Anti-patterns
- - Only looking at averages—trace is about specific slow requests
Exit condition: Runbook snippet: “How to find slowest span in checkout.”
Stage 6: Governance & Cost
Goal: PII controlled; budget predictable.
Practices
- - PII redaction processors; secrets never in attributes
- Retention policies per env; export to cheap storage for long-term if needed
- Ownership of semantic conventions in org
Final Review Checklist
- - [ ] Instrumentation covers critical paths and async boundaries
- [ ] Propagation validated; broken trace rate monitored
- [ ] Sampling policy balances cost vs signal
- [ ] Semantic conventions applied consistently
- [ ] PII/secrets not in spans
Tips for Effective Guidance
- - Prefer OpenTelemetry as the single API with vendor exporters—avoid vendor lock-in at instrumentation.
- DB spans: recommend query shape (normalized) not raw SQL in prod.
- Logs ↔ traces: inject trace_id in logs for correlation.
Handling Deviations
- - Monolith: single-process traces still valuable—async and thread hops still break.
- High cardinality crisis: drop labels first, then sampling—never drop error visibility blindly.
分布式追踪(深度工作流)
追踪能够回答哪个跳点消耗了时间,以及错误在哪些服务间浮现。成功需要一致的传播、有意义的跨度,以及既能保留信号又不会耗尽存储的采样。
何时提供此工作流
触发条件:
- - 微服务间A和B之间存在“未知延迟”
- 采用OpenTelemetry、Jaeger、Zipkin、X-Ray、Cloud Trace
- 需要服务地图和依赖关系洞察
- 追踪导致的高基数或成本问题
初始提供:
使用六个阶段:(1) 定义目标与SLO,(2) 埋点计划,(3) 传播与上下文,(4) 采样策略,(5) 分析工作流,(6) 治理与成本。确认语言和基础设施(K8s、服务网格)。
阶段1:目标与SLO
目标: 明确追踪存在的原因——延迟、错误、依赖发现,或客户旅程映射。
问题
- 1. 最关键的p95/p99痛点路由有哪些?
- 跨度属性是否存在合规或PII约束?
- 基数容忍度——每个跨度上是否包含用户ID?
退出条件: 成功指标:例如,“将结账中的未知时间减少到追踪持续时间的5%以下。”
阶段2:埋点计划
目标: 在有用的地方添加跨度——而不是每个函数。
层级
- - HTTP服务器中间件:每个请求一个跨度,路由名称规范化
- HTTP客户端:带有对端服务的出站跨度
- 数据库:带有语句类型的客户端跨度——生产环境默认不包含原始SQL文本
- 队列:带有消息关联的生产/消费跨度
- 后台任务:带有任务类型的独立跨度
命名
- - 跨度名称稳定(GET /orders/{id}模式)vs 高基数原始路径
属性
- - service.name、deployment.environment、http.status_code、db.system——遵循语义约定(OTel)
退出条件: 清单列出已自动埋点的框架与需要手动添加跨度的部分。
阶段3:传播与上下文
目标: 追踪ID跨越异步边界——没有断裂的追踪。
实践
- - HTTP使用W3C Trace Context头;Kafka/AMQP使用消息传播器
- 异步任务:调度时附加上下文(executor、asyncio、Promise)
- 批处理:谨慎使用链接跨度或行李——避免泄露PII
服务网格
- - 边车追踪 vs 库追踪——避免重复计数;配置单一可信源
退出条件: 断裂追踪率可测量;记录前5大原因(如缺失传播等)。
阶段4:采样策略
目标: 获取代表性追踪,而不存储所有内容。
基于头部
基于尾部
- - 保留感兴趣的追踪(高延迟、错误)——复杂度更高但信号更好
成本控制
退出条件: 书面策略:基准采样率 + 错误始终采样 + 延迟异常值采样。
阶段5:分析工作流
目标: 工程师在事件和性能工作中使用追踪。
工作流
- - 追踪视图:关键路径、最长子跨度
- 比较发布版本:同一路由,不同p99跨度
- 从边构建服务地图——验证意外依赖关系
反模式
退出条件: 操作手册片段:“如何在结账中找到最慢的跨度。”
阶段6:治理与成本
目标: PII受控;预算可预测。
实践
- - PII脱敏处理器;密钥绝不放入属性
- 按环境设置保留策略;如需长期存储,导出到廉价存储
- 组织中语义约定的归属权
最终审查清单
- - [ ] 埋点覆盖关键路径和异步边界
- [ ] 传播已验证;断裂追踪率已监控
- [ ] 采样策略平衡成本与信号
- [ ] 语义约定一致应用
- [ ] 跨度中不包含PII/密钥
有效指导技巧
- - 优先使用OpenTelemetry作为单一API,配合厂商导出器——避免在埋点层面被厂商锁定。
- 数据库跨度:推荐使用查询形状(规范化),生产环境不使用原始SQL。
- 日志 ↔ 追踪:在日志中注入trace_id以实现关联。
处理偏差
- - 单体应用:单进程追踪仍有价值——异步和线程跳转仍会断裂。
- 高基数危机:先丢弃标签,再调整采样——绝不盲目丢弃错误可见性。