Observability & SLOs (Deep Workflow)
SLOs connect engineering work to user-perceived reliability. SLIs must be measurable from systems but grounded in user journeys.
When to Offer This Workflow
Trigger conditions:
- - Defining 99.9% without defining for what
- Too many pages or none; need error budget discipline
- Product wants features while stability degrades
Initial offer:
Use six stages: (1) pick user journeys, (2) define SLIs, (3) set SLO targets & windows, (4) error budget policy, (5) alerting on budget burn, (6) review & iterate). Confirm metric stack and dependency SLOs from vendors.
Stage 1: User Journeys
Goal: Critical paths that matter if broken—checkout, login, API sync, not “CPU low”.
Output
3–10 journeys ranked by business impact and frequency.
Exit condition: One paragraph per journey: user intent + failure symptom.
Stage 2: Define SLIs
Goal: Ratio of good events over total over a window—implementation explicit.
Examples
- - Availability: successful requests / valid requests (define “valid”)
- Latency: proportion of requests faster than T ms
Good SLIs
- - Objective, low-cardinality enough to measure reliably
Exit condition: SLI formula + data source (metrics, logs, probes).
Stage 3: SLO Targets & Windows
Goal: Target (e.g., 99.9% monthly) implies allowed bad minutes—make it explicit.
Practices
- - Rolling 30d common; align with release cadence
- Tier services: not everything needs same SLO
Exit condition: Published table: journey → SLI → target → window.
Stage 4: Error Budget Policy
Goal: What we do when budget is healthy vs exhausted.
Policy ideas
- - Budget healthy → ship features; low → freeze risky changes, focus on reliability
- Escalation when budget burns fast (multi-window alerts)
Exit condition: Written policy with product sign-off.
Stage 5: Alerting on Burn
Goal: Page on budget burn rate, not every blip—multi-window multi-burn-rate pattern when using Google-style SLO alerting.
Practices
- - Fast burn = page soon; slow burn = ticket/track
Exit condition: Alert rules linked to runbooks.
Stage 6: Review & Iterate
Goal: SLOs drift with architecture—quarterly review; adjust targets with data.
Final Review Checklist
- - [ ] Journeys and SLIs tied to real user pain
- [ ] Targets realistic vs dependencies and cost
- [ ] Error budget policy agreed with product
- [ ] Alerts on burn, not noisy symptom spam
- [ ] Review cadence scheduled
Tips for Effective Guidance
- - Translate 99.9% to minutes/month of allowed badness.
- SLA (contract) vs SLO (internal)—don’t confuse.
- Dependency SLO caps what you can promise—surface that early.
Handling Deviations
- - No metrics yet: start with proxy SLI (synthetic probes) and improve instrumentation.
- Batch systems: event processing lag as SLI instead of HTTP.
可观测性与服务等级目标(深度工作流)
服务等级目标将工程工作与用户感知的可靠性联系起来。服务等级指标必须可从系统测量,但以用户旅程为基础。
何时提供此工作流
触发条件:
- - 定义了99.9% 但未定义针对什么
- 告警过多或过少;需要错误预算纪律
- 产品追求功能而稳定性下降
初始提供:
使用六个阶段:(1) 选择用户旅程,(2) 定义服务等级指标,(3) 设定服务等级目标阈值与窗口,(4) 错误预算策略,(5) 基于预算消耗的告警,(6) 审查与迭代。确认指标栈和来自供应商的依赖服务等级目标。
阶段1:用户旅程
目标: 一旦中断就会产生影响的关键路径——结账、登录、API同步,而非“CPU使用率低”。
输出
按业务影响和频率排序的3–10个旅程。
退出条件: 每个旅程一段描述:用户意图 + 故障症状。
阶段2:定义服务等级指标
目标: 在时间窗口内好事件占总事件的比率——实现方式需明确。
示例
- - 可用性:成功请求 / 有效请求(定义“有效”)
- 延迟:响应快于T毫秒的请求比例
良好的服务等级指标
退出条件: 服务等级指标公式 + 数据源(指标、日志、探针)。
阶段3:服务等级目标阈值与窗口
目标: 阈值(例如,月度99.9%)意味着允许的故障分钟数——需明确说明。
实践
- - 滚动30天常见;与发布节奏对齐
- 分层服务:并非所有服务都需要相同的服务等级目标
退出条件: 发布表格:旅程 → 服务等级指标 → 阈值 → 窗口。
阶段4:错误预算策略
目标: 预算健康时与预算耗尽时我们做什么。
策略建议
- - 预算健康 → 发布功能;预算不足 → 冻结风险变更,聚焦可靠性
- 预算快速消耗时升级处理(多窗口告警)
退出条件: 经产品签署的书面策略。
阶段5:基于消耗的告警
目标: 基于预算消耗速率告警,而非每次波动——使用Google风格的服务等级目标告警时采用多窗口 多消耗速率模式。
实践
- - 快速消耗 = 立即告警;慢速消耗 = 创建工单/跟踪
退出条件: 告警规则关联到运行手册。
阶段6:审查与迭代
目标: 服务等级目标随架构漂移——季度审查;根据数据调整阈值。
最终审查清单
- - [ ] 旅程和服务等级指标与真实用户痛点关联
- [ ] 阈值相对于依赖和成本现实可行
- [ ] 错误预算策略已与产品达成一致
- [ ] 告警基于消耗,而非嘈杂的症状垃圾信息
- [ ] 已安排审查节奏
有效指导技巧
- - 将99.9% 转换为每月允许的故障分钟数。
- 服务等级协议(合同)与服务等级目标(内部)——不要混淆。
- 依赖的服务等级目标限制了你所能承诺的范围——尽早揭示这一点。
处理偏差
- - 尚无指标:从代理服务等级指标(合成探针)开始,并改进仪表化。
- 批处理系统:使用事件处理延迟作为服务等级指标,而非HTTP。