SRE Practices (Deep Workflow)

SRE is not “ops with a fancy title”—it is engineering reliability with explicit trade-offs between velocity and stability, measured with SLOs and managed through error budgets and toil budgets.

When to Offer This Workflow

Trigger conditions:

- Defining or revisiting SLOs; too many pages or too few alerts
“We need five nines” without user-visible meaning
High toil: manual deploys, ticket-driven scaling, runbooks that never shrink
Post-incident push for “more reliability” without cost discussion

Initial offer:

Walk through six stages: (1) user journeys & SLIs, (2) SLO targets & windows, (3) error budgets & policy, (4) alerting & on-call, (5) toil & automation, (6) continuous improvement. Confirm service tiering and business criticality.

Stage 1: User Journeys & SLIs

Goal: Measure what users actually experience, not only server uptime.

Activities

- List critical journeys: signup, pay, search, API sync, etc.
For each, pick SLI types: availability, latency, freshness, correctness (where measurable)
Define SLI implementation: e.g., “successful HTTP 2xx from LB / all requests excluding health checks” vs deeper synthetic probes

Good SLIs

- Specific, measurable, aligned with pain—avoid vanity metrics

Exit condition: SLI definitions documented with data sources (metrics, logs, probes).

Stage 2: SLO Targets & Windows

Goal: Set achievable targets with explicit consequences.

Process

- Choose window: rolling 30d common; align with release cadence
Set target (e.g., 99.9% availability) from error budget math: allowed downtime per month
Tier services: not everything needs 99.99%

Realism

- Account for dependencies you don’t control (public cloud, third-party APIs)—SLO cannot exceed dependency SLO unless architecture isolates failures.

Exit condition: Published SLO document per service or journey with measurement method.

Stage 3: Error Budget Policy

Goal: Decide how to spend budget—feature velocity vs reliability work.

Policy Examples

- Budget healthy → ship aggressively; budget low → freeze risky changes, focus on reliability
Exceptions process: who can override, with what review

Communication

- Product/engineering shared ownership of budget—not “SRE says no” in the dark

Exit condition: Written policy: what happens when budget burns at 25/50/100%.

Stage 4: Alerting & On-Call

Goal: Pages are symptom-based, actionable, low noise.

Principles

- Alert on user pain or imminent SLO threat, not every blip
Severity maps to response: SEV1 customer-wide vs warning
Runbooks linked; ownership clear

On-Call Health

- Limit pages per engineer per week; track toil hours
Post-incident follow-through to reduce repeat pages

Exit condition: Alert inventory reviewed; tuning backlog for noisy alerts.

Stage 5: Toil & Automation

Goal: Reduce manual, repetitive, automatable work with measurable toil budgets.

Identify Toil

- Frequent tickets, manual scaling, click-ops deploys, data fixes without guardrails

Remediate

- Eliminate > automate > document—in that preference order when safe
Self-service platforms with guardrails beat hero scripts

Exit condition: Toil reduction roadmap with owners; ideally 50% toil cap aspiration per team norm (Google SRE guideline—adapt to org).

Stage 6: Continuous Improvement

Goal: Reliability work is prioritized like features.

Loops

- Incident → action items with tracking
Game days / failure injection where mature
Quarterly SLO review—targets drift with product changes

Final Review Checklist

- [ ] SLIs tied to user-visible outcomes
[ ] SLO targets realistic vs dependencies
[ ] Error budget policy agreed with product
[ ] Alerts actionable; noise tracked
[ ] Toil identified with automation path

Tips for Effective Guidance

- Translate 99.9% to minutes of downtime per month—makes trade-offs concrete.
Never promise zero incidents; promise learning and measurable improvement.
Separate SLI (measurement) from SLO (target) from SLA (contract)—terms get confused.

Handling Deviations

- Early startup: start with basic monitoring + incident reviews before full SLO program.
No SRE role: practices still apply—relabel “production excellence” if needed.

SRE实践（深度工作流）

SRE并非花哨头衔的运维——它是工程化的可靠性，在速度与稳定性之间进行明确权衡，通过SLO衡量，并借助错误预算和辛劳预算进行管理。

何时提供此工作流

触发条件：

- 定义或重新审视SLO；告警过多或过少
我们需要五个九但缺乏用户可见的意义
高辛劳：手动部署、工单驱动的扩缩容、永不缩减的运维手册
事后推动更高可靠性但未讨论成本

初始提供：

经历六个阶段：(1) 用户旅程与SLI，(2) SLO目标与窗口，(3) 错误预算与策略，(4) 告警与值班，(5) 辛劳与自动化，(6) 持续改进。确认服务分级和业务关键性。

阶段一：用户旅程与SLI

目标： 衡量用户实际体验，而非仅服务器运行时间。

活动

- 列出关键旅程：注册、支付、搜索、API同步等
针对每项，选择SLI类型：可用性、延迟、新鲜度、正确性（可衡量时）
定义SLI实现：例如，来自负载均衡的成功HTTP 2xx / 排除健康检查的所有请求 vs 更深入的合成探测

优秀SLI

- 具体、可衡量、与痛点对齐——避免虚荣指标

退出条件： SLI定义文档化，包含数据来源（指标、日志、探测）。

阶段二：SLO目标与窗口

目标： 设定可实现的目标，并附带明确后果。

流程

- 选择窗口：常用滚动30天；与发布节奏对齐
根据错误预算计算设定目标（例如，99.9%可用性）：每月允许的停机时间
分级服务：并非所有服务都需要99.99%

现实考量

- 考虑你无法控制的依赖项（公有云、第三方API）——除非架构隔离故障，否则SLO不能超过依赖项的SLO。

退出条件： 每项服务或旅程发布SLO文档，包含衡量方法。

阶段三：错误预算策略

目标： 决定如何花费预算——功能速度 vs 可靠性工作。

策略示例

- 预算健康 → 积极发布；预算低 → 冻结风险变更，聚焦可靠性
例外流程：谁可以覆盖，需要什么审查

沟通

- 产品/工程共同拥有预算——而非SRE在暗处说不

退出条件： 书面策略：预算消耗至25/50/100%时会发生什么。

阶段四：告警与值班

目标： 告警应基于症状、可操作、低噪音。

原则

- 对用户痛点或即将威胁SLO进行告警，而非每次波动
严重级别映射至响应：SEV1影响全客户 vs 警告
运维手册已关联；所有权明确

值班健康

- 限制每位工程师每周告警次数；跟踪辛劳工时
事后跟进以减少重复告警

退出条件： 告警清单已审查；针对噪音告警的调优待办项。

阶段五：辛劳与自动化

目标： 通过可衡量的辛劳预算，减少手动、重复、可自动化的工作。

识别辛劳

- 频繁工单、手动扩缩容、点击式部署、无防护的数据修复

修复措施

- 消除 > 自动化 > 文档化——在安全前提下按此优先级顺序
带防护的自助服务平台优于英雄脚本

退出条件： 辛劳减少路线图，明确负责人；理想情况下，按团队规范（Google SRE指南——适应组织）设定50%辛劳上限目标。

阶段六：持续改进

目标： 可靠性工作应像功能一样被优先处理。

循环

- 事件 → 带跟踪的行动项
演练日 / 故障注入（成熟时）
季度SLO审查——目标随产品变更而调整

最终审查清单

- [ ] SLI与用户可见结果关联
[ ] SLO目标相对于依赖项现实可行
[ ] 错误预算策略已与产品达成一致
[ ] 告警可操作；噪音已跟踪
[ ] 辛劳已识别，并附自动化路径

有效指导技巧

- 将99.9%转化为每月停机分钟数——使权衡具体化。
绝不承诺零事件；承诺学习和可衡量的改进。
区分SLI（衡量指标）、SLO（目标）和SLA（合同）——这些术语常被混淆。

处理偏差

- 早期创业公司：在完整SLO计划前，从基础监控+事件回顾开始。
无SRE角色：实践仍适用——必要时可重新标记为生产卓越。

sre-practicesSRE实践指南

sre-practices

SRE Practices (Deep Workflow)

When to Offer This Workflow

Stage 1: User Journeys & SLIs

Activities

Good SLIs

Stage 2: SLO Targets & Windows

Process

Realism

Stage 3: Error Budget Policy

Policy Examples

Communication

Stage 4: Alerting & On-Call

Principles

On-Call Health

Stage 5: Toil & Automation

Identify Toil

Remediate

Stage 6: Continuous Improvement

Loops

Final Review Checklist

Tips for Effective Guidance

Handling Deviations

SRE实践（深度工作流）

何时提供此工作流

阶段一：用户旅程与SLI

活动

优秀SLI

阶段二：SLO目标与窗口

流程

现实考量

阶段三：错误预算策略

策略示例

沟通

阶段四：告警与值班

原则

值班健康

阶段五：辛劳与自动化

识别辛劳

修复措施

阶段六：持续改进

循环

最终审查清单

有效指导技巧

处理偏差

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement