Load Testing (Deep Workflow)
Load tests answer whether the system meets behavior under target load—not “how many RPS the tool prints.” Tie every run to SLOs, workload realism, and analysis that engineers can act on.
When to Offer This Workflow
Trigger conditions:
- - Major launch, traffic spike season, infra resize
- Latency/timeout under peak; need evidence for capacity decisions
- Comparing architectures or debottlenecking
Initial offer:
Use seven stages: (1) goals & SLOs, (2) workload model, (3) scenarios & scripts, (4) environment & data, (5) run & observe, (6) analyze bottlenecks, (7) fixes & retest. Confirm tool (k6, Locust, Gatling, JMeter) and environment policy (prod-like staging vs synthetic).
Stage 1: Goals & SLOs
Goal: Define success in measurable terms.
Questions
- 1. Peak RPS/users, growth assumption, duration of peak
- SLOs: p95/p99 latency, error rate, throughput per critical endpoint
- Scope: read-heavy vs write-heavy; background jobs interaction
Exit condition: Numeric targets + out of scope (e.g., “third-party API mocked”).
Stage 2: Workload Model
Goal: Representative mix—not one URL forever.
Practices
- - Transaction mix from analytics or access logs (proportions)
- Think time between steps for user journeys
- Payload size distribution; auth token behavior
- Spike vs soak vs step ramp—match real failure modes
Exit condition: Workload profile documented (table or script comments).
Stage 3: Scenarios & Scripts
Goal: Deterministic, idempotent load scripts where possible.
Practices
- - Correlate virtual user with trace/request id for debugging
- Parameterize data to avoid cache fantasy (every request hits same key)
- Order operations to match real causality (login → browse → checkout)
Pitfalls
- - Client-side bottleneck (single generator machine)—distribute load generators
Exit condition: Smoke run at small k validates script correctness.
Stage 4: Environment & Data
Goal: Fidelity without destroying prod.
Rules
- - Staging scale proportional; feature flags aligned
- Data volume similar order-of-magnitude for DB plans
- External deps: mock, sandbox, or throttle awareness
Exit condition: Safety checklist: no prod writes unless explicitly planned and isolated.
Stage 5: Run & Observe
Goal: System-wide visibility during test.
Instrumentation
- - App: latency histograms, error codes, queue depth
- Infra: CPU, memory, connections, GC, disk IOPS
- DB: slow queries, locks, replication lag
- Tracing sample during test for hot spans
Exit condition: Dashboard or runbook link for the test window.
Stage 6: Analyze Bottlenecks
Goal: Identify dominant constraint: app, DB, network, dependency.
Process
- - Utilization vs saturation (e.g., CPU high but wait on locks—different fix)
- Compare p95 vs max—tail often separate issue
- Reproduce bottleneck with smaller experiment when unclear
Exit condition: Written hypothesis with evidence (graphs, trace ids).
Stage 7: Fixes & Retest
Goal: Controlled changes with retest protocol.
Practices
- - One major change per retest when debugging
- Document baseline vs after for regression to capacity planning
Final Review Checklist
- - [ ] SLO-aligned goals and workload mix
- [ ] Realistic scenarios; distributed load if needed
- [ ] Environment safe and representative enough
- [ ] Full-stack observability during runs
- [ ] Bottleneck analysis leads to actionable tickets
Tips for Effective Guidance
- - Warm caches explicitly if prod is always warm—otherwise misleading good numbers.
- Throughput without latency SLO is meaningless.
- Call out coordination overhead (locks, hot keys) vs raw CPU.
Handling Deviations
- - Cannot match prod data: state assumptions and test directional only.
- Serverless: account for cold start and account concurrency limits in interpretation.
负载测试(深度工作流程)
负载测试回答的是系统在目标负载下是否满足行为要求——而不是“工具能打出多少RPS”。每次测试都要与SLO、工作负载真实性以及工程师可执行的分析挂钩。
何时提供此工作流程
触发条件:
- - 重大发布、流量高峰季节、基础设施扩容
- 峰值下的延迟/超时问题;需要证据来支撑容量决策
- 架构对比或瓶颈排查
初始建议:
使用七个阶段:(1) 目标与SLO,(2) 工作负载模型,(3) 场景与脚本,(4) 环境与数据,(5) 运行与观测,(6) 分析瓶颈,(7) 修复与重测。确认工具(k6、Locust、Gatling、JMeter)和环境策略(类生产预发布环境 vs 合成环境)。
阶段1:目标与SLO
目标: 用可衡量的指标定义成功。
问题
- 1. 峰值RPS/用户数、增长假设、峰值持续时间
- SLO:每个关键端点的p95/p99延迟、错误率、吞吐量
- 范围:读密集型 vs 写密集型;后台任务交互
退出条件: 数值目标 + 排除范围(例如“第三方API已模拟”)。
阶段2:工作负载模型
目标: 具有代表性的混合请求——而不是永远只打一个URL。
实践
- - 从分析或访问日志中获取事务混合比例
- 用户旅程步骤间的思考时间
- 请求体大小分布;认证令牌行为
- 突发 vs 浸泡 vs 阶梯递增——匹配真实故障模式
退出条件: 记录工作负载配置文件(表格或脚本注释)。
阶段3:场景与脚本
目标: 尽可能编写确定性、幂等的负载脚本。
实践
- - 将虚拟用户与追踪/请求ID关联以便调试
- 参数化数据以避免缓存的假象(每个请求都命中同一个键)
- 按真实的因果顺序执行操作(登录→浏览→结账)
陷阱
- - 客户端瓶颈(单台生成机器)——分布式负载生成器
退出条件: 小规模冒烟测试验证脚本正确性。
阶段4:环境与数据
目标: 在不破坏生产环境的前提下保证保真度。
规则
- - 预发布环境按比例缩放;功能开关保持一致
- 数据量在数量级上相似,以匹配数据库的执行计划
- 外部依赖:模拟、沙箱或限流感知
退出条件: 安全检查清单:除非明确规划并隔离,否则不写入生产环境。
阶段5:运行与观测
目标: 测试期间实现系统级可见性。
监控手段
- - 应用:延迟直方图、错误码、队列深度
- 基础设施:CPU、内存、连接数、GC、磁盘IOPS
- 数据库:慢查询、锁、复制延迟
- 测试期间对热点跨度进行追踪采样
退出条件: 测试窗口的仪表盘或操作手册链接。
阶段6:分析瓶颈
目标: 识别主要约束:应用、数据库、网络、依赖。
过程
- - 利用率 vs 饱和度(例如CPU高但等待锁——修复方式不同)
- 比较p95与最大值——尾部延迟通常是独立问题
- 不明确时通过更小的实验复现瓶颈
退出条件: 带有证据(图表、追踪ID)的书面假设。
阶段7:修复与重测
目标: 通过受控变更和重测协议进行。
实践
- - 调试时每次重测只做一个主要变更
- 记录回归测试的基线与变更后数据,用于容量规划
最终审查清单
- - [ ] 与SLO对齐的目标和工作负载混合
- [ ] 真实的场景;必要时使用分布式负载
- [ ] 环境安全且具有足够代表性
- [ ] 运行期间的全栈可观测性
- [ ] 瓶颈分析可转化为可执行的任务单
有效指导技巧
- - 如果生产环境始终是热缓存,则显式预热缓存——否则会得到误导性的良好数据。
- 没有延迟SLO的吞吐量毫无意义。
- 指出协调开销(锁、热点键)与原始CPU的区别。
处理偏差
- - 无法匹配生产数据:说明假设,仅进行方向性测试。
- 无服务器架构:在解释时考虑冷启动和账户并发限制。