Error Handling
Consistent errors reduce support load and on-call pain. Design a taxonomy, stable codes, safe user messaging, and operator visibility—without leaking secrets or stack traces to clients.
When to Offer This Workflow
Trigger conditions:
- - Inconsistent HTTP status codes and response bodies
- Retry storms or duplicate side effects from naive retries
- Logs that cannot be tied to user-visible failures
Initial offer:
Use six stages: (1) classify errors, (2) map to transport, (3) user messaging, (4) retries & idempotency, (5) observability, (6) client SDKs & DX). Confirm REST/GraphQL/gRPC and sync/async patterns.
Stage 1: Classify Errors
Goal: Distinguish validation, authentication, authorization, not found, conflict, rate limit, dependency failure, and internal bugs.
Exit condition: Table or enum of codes with owning team and meaning.
Stage 2: Map to Transport
Goal: Correct HTTP 4xx/5xx; GraphQL errors with extensions; gRPC status codes; optional RFC 7807 Problem Details for JSON APIs.
Stage 3: User Messaging
Goal: Actionable copy for end users; opaque support reference id; no internal hostnames, SQL fragments, or stack traces in client responses.
Stage 4: Retries & Idempotency
Goal: Retry only safe or idempotent operations; exponential backoff with jitter; align with idempotency keys on writes.
Stage 5: Observability
Goal: Structured logs with error.code, trace_id, user_id (where allowed); metrics by error class; alerts on error-rate SLO burn.
Stage 6: Client SDKs & DX
Goal: Typed errors in SDKs; documented recovery; map codes to user-facing strings in apps consistently.
Final Review Checklist
- - [ ] Taxonomy and ownership defined
- [ ] Transport mapping correct and consistent
- [ ] User-safe messages with correlation ids
- [ ] Retry policy matches idempotency story
- [ ] Logs and metrics wired for ops
Tips for Effective Guidance
- - Separate expected validation errors from unexpected 500s in dashboards.
- Pair with idempotency for write paths and queues.
Handling Deviations
- - Mobile offline: queue with explicit user-visible sync state.
错误处理
一致的错误处理能减少支持负担和值班痛苦。设计一套分类体系、稳定错误码、安全的用户消息和运维可见性——同时避免向客户端泄露密钥或堆栈跟踪。
何时提供此工作流
触发条件:
- - HTTP 状态码和响应体不一致
- 因简单重试导致的重试风暴或重复副作用
- 日志无法关联到用户可见的故障
初始方案:
使用六个阶段:(1) 错误分类,(2) 映射到传输层,(3) 用户消息,(4) 重试与幂等性,(5) 可观测性,(6) 客户端 SDK 与开发者体验。确认 REST/GraphQL/gRPC 以及同步/异步模式。
阶段 1:错误分类
目标: 区分校验错误、认证错误、授权错误、未找到、冲突、限流、依赖故障和内部缺陷。
退出条件: 包含错误码、所属团队和含义的表格或枚举。
阶段 2:映射到传输层
目标: 正确的 HTTP 4xx/5xx 状态码;带扩展信息的 GraphQL 错误;gRPC 状态码;JSON API 可选的 RFC 7807 问题详情。
阶段 3:用户消息
目标: 面向终端用户的可操作文案;不透明的支持参考 ID;客户端响应中不包含内部主机名、SQL 片段或堆栈跟踪。
阶段 4:重试与幂等性
目标: 仅重试安全或幂等的操作;带抖动的指数退避;与写入操作的幂等性键对齐。
阶段 5:可观测性
目标: 包含 error.code、traceid、userid(允许时)的结构化日志;按错误类别的指标;基于错误率 SLO 燃烧的告警。
阶段 6:客户端 SDK 与开发者体验
目标: SDK 中的类型化错误;有文档记录的恢复方案;在应用中一致地将错误码映射到面向用户的字符串。
最终审查清单
- - [ ] 已定义分类体系和归属
- [ ] 传输层映射正确且一致
- [ ] 用户安全消息包含关联 ID
- [ ] 重试策略与幂等性方案匹配
- [ ] 日志和指标已为运维配置
有效指导技巧
- - 在仪表板中将预期的校验错误与意外的 500 错误分开。
- 为写入路径和队列配合使用幂等性。
处理偏差情况
- - 移动端离线:使用队列并附带用户可见的显式同步状态。