Distributed Tracing (Deep Workflow)

Traces answer which hop consumed time and where errors surfaced across services. Success requires consistent propagation, meaningful spans, and sampling that preserves signal without bankrupting storage.

When to Offer This Workflow

Trigger conditions:

- Microservices “unknown latency” between A and B
Adopting OpenTelemetry, Jaeger, Zipkin, X-Ray, Cloud Trace
Need service map and dependency insights
High cardinality or cost concerns from traces

Initial offer:

Use six stages: (1) define goals & SLOs, (2) instrumentation plan, (3) propagation & context, (4) sampling strategy, (5) analysis workflows, (6) governance & cost. Confirm languages and infra (K8s, service mesh).

Stage 1: Goals & SLOs

Goal: Know why tracing exists—latency, errors, dependency discovery, or customer journey mapping.

Questions

1. Top p95/p99 pain routes?
Compliance or PII constraints on span attributes?
Cardinality tolerance—user IDs on every span?

Exit condition: Success metrics: e.g., “reduce unknown time in checkout to <5% of trace duration.”

Stage 2: Instrumentation Plan

Goal: Spanness where it helps—not every function.

Layers

- HTTP server middleware: span per request, route name normalized
HTTP clients: outgoing spans with peer service
DB: client spans with statement type—not raw SQL text in prod by default
Queues: produce/consume spans with message correlation
Background jobs: separate spans with job type

Naming

- Span names stable (GET /orders/{id} patterns) vs high-cardinality raw paths

Attributes

- service.name, deployment.environment, http.status_code, db.system—follow semantic conventions (OTel)

Exit condition: Inventory of frameworks auto-instrumented vs manual spans needed.

Stage 3: Propagation & Context

Goal: Trace ID crosses async boundaries—no broken traces.

Practices

- W3C Trace Context headers for HTTP; messaging propagators for Kafka/AMQP
Async tasks: attach context when scheduling (executor, asyncio, Promise)
Batch processing: link spans or baggage carefully—avoid leaking PII

Service mesh

- Sidecar tracing vs library tracing—avoid double counting; configure one source of truth

Exit condition: Broken trace rate measurable; top 5 causes documented (missing propagation, etc.).

Stage 4: Sampling Strategy

Goal: Representative traces without storing everything.

Head-based

- Fixed percentage; always sample errors (tail sampling often still needed)

Tail-based

- Interesting traces (high latency, errors) retained—complexity but better signal

Cost controls

- Attribute limits; span limits per trace; drop health checks

Exit condition: Written policy: baseline rate + error always + latency outliers.

Stage 5: Analysis Workflows

Goal: Engineers use traces in incidents and perf work.

Workflows

- Trace view: critical path, longest child span
Compare releases: same route, different p99 span
Service map from edges—validate unexpected dependencies

Anti-patterns

- Only looking at averages—trace is about specific slow requests

Exit condition: Runbook snippet: “How to find slowest span in checkout.”

Stage 6: Governance & Cost

Goal: PII controlled; budget predictable.

Practices

- PII redaction processors; secrets never in attributes
Retention policies per env; export to cheap storage for long-term if needed
Ownership of semantic conventions in org

Final Review Checklist

- [ ] Instrumentation covers critical paths and async boundaries
[ ] Propagation validated; broken trace rate monitored
[ ] Sampling policy balances cost vs signal
[ ] Semantic conventions applied consistently
[ ] PII/secrets not in spans

Tips for Effective Guidance

- Prefer OpenTelemetry as the single API with vendor exporters—avoid vendor lock-in at instrumentation.
DB spans: recommend query shape (normalized) not raw SQL in prod.
Logs ↔ traces: inject trace_id in logs for correlation.

Handling Deviations

- Monolith: single-process traces still valuable—async and thread hops still break.
High cardinality crisis: drop labels first, then sampling—never drop error visibility blindly.

分布式追踪（深度工作流）

追踪能够回答哪个跳点消耗了时间，以及错误在哪些服务间浮现。成功需要一致的传播、有意义的跨度，以及既能保留信号又不会耗尽存储的采样。

何时提供此工作流

触发条件：

- 微服务间A和B之间存在“未知延迟”
采用OpenTelemetry、Jaeger、Zipkin、X-Ray、Cloud Trace
需要服务地图和依赖关系洞察
追踪导致的高基数或成本问题

初始提供：

使用六个阶段：(1) 定义目标与SLO，(2) 埋点计划，(3) 传播与上下文，(4) 采样策略，(5) 分析工作流，(6) 治理与成本。确认语言和基础设施（K8s、服务网格）。

阶段1：目标与SLO

目标： 明确追踪存在的原因——延迟、错误、依赖发现，或客户旅程映射。

问题

1. 最关键的p95/p99痛点路由有哪些？
跨度属性是否存在合规或PII约束？
基数容忍度——每个跨度上是否包含用户ID？

退出条件： 成功指标：例如，“将结账中的未知时间减少到追踪持续时间的5%以下。”

阶段2：埋点计划

目标： 在有用的地方添加跨度——而不是每个函数。

层级

- HTTP服务器中间件：每个请求一个跨度，路由名称规范化
HTTP客户端：带有对端服务的出站跨度
数据库：带有语句类型的客户端跨度——生产环境默认不包含原始SQL文本
队列：带有消息关联的生产/消费跨度
后台任务：带有任务类型的独立跨度

命名

- 跨度名称稳定（GET /orders/{id}模式）vs 高基数原始路径

属性

- service.name、deployment.environment、http.status_code、db.system——遵循语义约定（OTel）

退出条件： 清单列出已自动埋点的框架与需要手动添加跨度的部分。

阶段3：传播与上下文

目标： 追踪ID跨越异步边界——没有断裂的追踪。

实践

- HTTP使用W3C Trace Context头；Kafka/AMQP使用消息传播器
异步任务：调度时附加上下文（executor、asyncio、Promise）
批处理：谨慎使用链接跨度或行李——避免泄露PII

服务网格

- 边车追踪 vs 库追踪——避免重复计数；配置单一可信源

退出条件： 断裂追踪率可测量；记录前5大原因（如缺失传播等）。

阶段4：采样策略

目标： 获取代表性追踪，而不存储所有内容。

基于头部

- 固定百分比；始终采样错误（通常仍需尾部采样）

基于尾部

- 保留感兴趣的追踪（高延迟、错误）——复杂度更高但信号更好

成本控制

- 属性限制；每个追踪的跨度限制；丢弃健康检查

退出条件： 书面策略：基准采样率 + 错误始终采样 + 延迟异常值采样。

阶段5：分析工作流

目标： 工程师在事件和性能工作中使用追踪。

工作流

- 追踪视图：关键路径、最长子跨度
比较发布版本：同一路由，不同p99跨度
从边构建服务地图——验证意外依赖关系

反模式

- 仅查看平均值——追踪关注的是特定慢请求

退出条件： 操作手册片段：“如何在结账中找到最慢的跨度。”

阶段6：治理与成本

目标： PII受控；预算可预测。

实践

- PII脱敏处理器；密钥绝不放入属性
按环境设置保留策略；如需长期存储，导出到廉价存储
组织中语义约定的归属权

最终审查清单

- [ ] 埋点覆盖关键路径和异步边界
[ ] 传播已验证；断裂追踪率已监控
[ ] 采样策略平衡成本与信号
[ ] 语义约定一致应用
[ ] 跨度中不包含PII/密钥

有效指导技巧

- 优先使用OpenTelemetry作为单一API，配合厂商导出器——避免在埋点层面被厂商锁定。
数据库跨度：推荐使用查询形状（规范化），生产环境不使用原始SQL。
日志 ↔ 追踪：在日志中注入trace_id以实现关联。

处理偏差

- 单体应用：单进程追踪仍有价值——异步和线程跳转仍会断裂。
高基数危机：先丢弃标签，再调整采样——绝不盲目丢弃错误可见性。

tracing分布式追踪

tracing

Distributed Tracing (Deep Workflow)

When to Offer This Workflow

Stage 1: Goals & SLOs

Questions

Stage 2: Instrumentation Plan

Layers

Naming

Attributes

Stage 3: Propagation & Context

Practices

Service mesh

Stage 4: Sampling Strategy

Head-based

Tail-based

Cost controls

Stage 5: Analysis Workflows

Workflows

Anti-patterns

Stage 6: Governance & Cost

Practices

Final Review Checklist

Tips for Effective Guidance

Handling Deviations

分布式追踪（深度工作流）

何时提供此工作流

阶段1：目标与SLO

问题

阶段2：埋点计划

层级

命名

属性

阶段3：传播与上下文

实践

服务网格

阶段4：采样策略

基于头部

基于尾部

成本控制

阶段5：分析工作流

工作流

反模式

阶段6：治理与成本

实践

最终审查清单

有效指导技巧

处理偏差

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement