SRE Practices (Deep Workflow)
SRE is not “ops with a fancy title”—it is engineering reliability with explicit trade-offs between velocity and stability, measured with SLOs and managed through error budgets and toil budgets.
When to Offer This Workflow
Trigger conditions:
- - Defining or revisiting SLOs; too many pages or too few alerts
- “We need five nines” without user-visible meaning
- High toil: manual deploys, ticket-driven scaling, runbooks that never shrink
- Post-incident push for “more reliability” without cost discussion
Initial offer:
Walk through six stages: (1) user journeys & SLIs, (2) SLO targets & windows, (3) error budgets & policy, (4) alerting & on-call, (5) toil & automation, (6) continuous improvement. Confirm service tiering and business criticality.
Stage 1: User Journeys & SLIs
Goal: Measure what users actually experience, not only server uptime.
Activities
- - List critical journeys: signup, pay, search, API sync, etc.
- For each, pick SLI types: availability, latency, freshness, correctness (where measurable)
- Define SLI implementation: e.g., “successful HTTP 2xx from LB / all requests excluding health checks” vs deeper synthetic probes
Good SLIs
- - Specific, measurable, aligned with pain—avoid vanity metrics
Exit condition: SLI definitions documented with data sources (metrics, logs, probes).
Stage 2: SLO Targets & Windows
Goal: Set achievable targets with explicit consequences.
Process
- - Choose window: rolling 30d common; align with release cadence
- Set target (e.g., 99.9% availability) from error budget math: allowed downtime per month
- Tier services: not everything needs 99.99%
Realism
- - Account for dependencies you don’t control (public cloud, third-party APIs)—SLO cannot exceed dependency SLO unless architecture isolates failures.
Exit condition: Published SLO document per service or journey with measurement method.
Stage 3: Error Budget Policy
Goal: Decide how to spend budget—feature velocity vs reliability work.
Policy Examples
- - Budget healthy → ship aggressively; budget low → freeze risky changes, focus on reliability
- Exceptions process: who can override, with what review
Communication
- - Product/engineering shared ownership of budget—not “SRE says no” in the dark
Exit condition: Written policy: what happens when budget burns at 25/50/100%.
Stage 4: Alerting & On-Call
Goal: Pages are symptom-based, actionable, low noise.
Principles
- - Alert on user pain or imminent SLO threat, not every blip
- Severity maps to response: SEV1 customer-wide vs warning
- Runbooks linked; ownership clear
On-Call Health
- - Limit pages per engineer per week; track toil hours
- Post-incident follow-through to reduce repeat pages
Exit condition: Alert inventory reviewed; tuning backlog for noisy alerts.
Stage 5: Toil & Automation
Goal: Reduce manual, repetitive, automatable work with measurable toil budgets.
Identify Toil
- - Frequent tickets, manual scaling, click-ops deploys, data fixes without guardrails
Remediate
- - Eliminate > automate > document—in that preference order when safe
- Self-service platforms with guardrails beat hero scripts
Exit condition: Toil reduction roadmap with owners; ideally 50% toil cap aspiration per team norm (Google SRE guideline—adapt to org).
Stage 6: Continuous Improvement
Goal: Reliability work is prioritized like features.
Loops
- - Incident → action items with tracking
- Game days / failure injection where mature
- Quarterly SLO review—targets drift with product changes
Final Review Checklist
- - [ ] SLIs tied to user-visible outcomes
- [ ] SLO targets realistic vs dependencies
- [ ] Error budget policy agreed with product
- [ ] Alerts actionable; noise tracked
- [ ] Toil identified with automation path
Tips for Effective Guidance
- - Translate 99.9% to minutes of downtime per month—makes trade-offs concrete.
- Never promise zero incidents; promise learning and measurable improvement.
- Separate SLI (measurement) from SLO (target) from SLA (contract)—terms get confused.
Handling Deviations
- - Early startup: start with basic monitoring + incident reviews before full SLO program.
- No SRE role: practices still apply—relabel “production excellence” if needed.
SRE实践(深度工作流)
SRE并非花哨头衔的运维——它是工程化的可靠性,在速度与稳定性之间进行明确权衡,通过SLO衡量,并借助错误预算和辛劳预算进行管理。
何时提供此工作流
触发条件:
- - 定义或重新审视SLO;告警过多或过少
- 我们需要五个九但缺乏用户可见的意义
- 高辛劳:手动部署、工单驱动的扩缩容、永不缩减的运维手册
- 事后推动更高可靠性但未讨论成本
初始提供:
经历六个阶段:(1) 用户旅程与SLI,(2) SLO目标与窗口,(3) 错误预算与策略,(4) 告警与值班,(5) 辛劳与自动化,(6) 持续改进。确认服务分级和业务关键性。
阶段一:用户旅程与SLI
目标: 衡量用户实际体验,而非仅服务器运行时间。
活动
- - 列出关键旅程:注册、支付、搜索、API同步等
- 针对每项,选择SLI类型:可用性、延迟、新鲜度、正确性(可衡量时)
- 定义SLI实现:例如,来自负载均衡的成功HTTP 2xx / 排除健康检查的所有请求 vs 更深入的合成探测
优秀SLI
退出条件: SLI定义文档化,包含数据来源(指标、日志、探测)。
阶段二:SLO目标与窗口
目标: 设定可实现的目标,并附带明确后果。
流程
- - 选择窗口:常用滚动30天;与发布节奏对齐
- 根据错误预算计算设定目标(例如,99.9%可用性):每月允许的停机时间
- 分级服务:并非所有服务都需要99.99%
现实考量
- - 考虑你无法控制的依赖项(公有云、第三方API)——除非架构隔离故障,否则SLO不能超过依赖项的SLO。
退出条件: 每项服务或旅程发布SLO文档,包含衡量方法。
阶段三:错误预算策略
目标: 决定如何花费预算——功能速度 vs 可靠性工作。
策略示例
- - 预算健康 → 积极发布;预算低 → 冻结风险变更,聚焦可靠性
- 例外流程:谁可以覆盖,需要什么审查
沟通
- - 产品/工程共同拥有预算——而非SRE在暗处说不
退出条件: 书面策略:预算消耗至25/50/100%时会发生什么。
阶段四:告警与值班
目标: 告警应基于症状、可操作、低噪音。
原则
- - 对用户痛点或即将威胁SLO进行告警,而非每次波动
- 严重级别映射至响应:SEV1影响全客户 vs 警告
- 运维手册已关联;所有权明确
值班健康
- - 限制每位工程师每周告警次数;跟踪辛劳工时
- 事后跟进以减少重复告警
退出条件: 告警清单已审查;针对噪音告警的调优待办项。
阶段五:辛劳与自动化
目标: 通过可衡量的辛劳预算,减少手动、重复、可自动化的工作。
识别辛劳
- - 频繁工单、手动扩缩容、点击式部署、无防护的数据修复
修复措施
- - 消除 > 自动化 > 文档化——在安全前提下按此优先级顺序
- 带防护的自助服务平台优于英雄脚本
退出条件: 辛劳减少路线图,明确负责人;理想情况下,按团队规范(Google SRE指南——适应组织)设定50%辛劳上限目标。
阶段六:持续改进
目标: 可靠性工作应像功能一样被优先处理。
循环
- - 事件 → 带跟踪的行动项
- 演练日 / 故障注入(成熟时)
- 季度SLO审查——目标随产品变更而调整
最终审查清单
- - [ ] SLI与用户可见结果关联
- [ ] SLO目标相对于依赖项现实可行
- [ ] 错误预算策略已与产品达成一致
- [ ] 告警可操作;噪音已跟踪
- [ ] 辛劳已识别,并附自动化路径
有效指导技巧
- - 将99.9%转化为每月停机分钟数——使权衡具体化。
- 绝不承诺零事件;承诺学习和可衡量的改进。
- 区分SLI(衡量指标)、SLO(目标)和SLA(合同)——这些术语常被混淆。
处理偏差
- - 早期创业公司:在完整SLO计划前,从基础监控+事件回顾开始。
- 无SRE角色:实践仍适用——必要时可重新标记为生产卓越。