SQL Optimization (Deep Workflow)
Optimization without measurement is guesswork. Structure the work as observe → explain (plan) → change → verify, with explicit attention to correctness, locks, and write amplification from indexes.
When to Offer This Workflow
Trigger conditions:
- - Slow queries, growing P95/P99, replication lag, lock waits
- ORM-generated SQL surprises; N+1 at DB layer
- Index explosion, bloat, or “we added indexes everywhere”
Initial offer:
Use six stages: (1) frame the problem, (2) reproduce & measure, (3) read execution plans, (4) schema & indexes, (5) query & transaction tuning, (6) verify & guardrail. Confirm engine (PostgreSQL, MySQL, SQL Server, etc.) and environment (prod-like data volume).
Stage 1: Frame the Problem
Goal: Define SLO, scope, and non-goals.
Questions
- 1. Which queries or endpoints are slow? User-facing vs batch?
- Regression—did deploy, data volume, or stats change?
- Isolation level and consistency requirements—can we read replicas?
- Write risk: is this table write-heavy? Index cost?
Exit condition: One-line problem statement with metric (e.g., “p95 2.4s on /reports at 10k RPS”).
Stage 2: Reproduce & Measure
Goal: Stable repro with representative cardinality and parameters.
Actions
- - Capture exact SQL, parameters, and frequency
- Use EXPLAIN (ANALYZE, BUFFERS) or equivalent—engine-specific
- Check buffer cache effects: cold vs warm cache; run twice when needed
- Compare prod stats vs staging—row counts, histograms
Pitfalls
- - Optimizing on empty dev DB
- Different parameter sniffing values changing plan choice
Exit condition: Baseline numbers + plan hash or saved plan for A/B.
Stage 3: Read Execution Plans
Goal: Name the dominant cost: seq scan, bad join order, sort, hash spill, nested loop explosion.
Interpret (adapt to engine)
- - Seq scan on large tables—filter selectivity? missing index? stats?
- Index scan vs bitmap vs index only—covering indexes trade-offs
- Joins: wrong order, missing stats, outdated NDV
- Sort/hash spills to disk—work_mem / memory grants
- Locks:
FOR UPDATE, long transactions, hot row updates
Exit condition: Hypothesis tied to plan node(s), not generic “add index.”
Stage 4: Schema & Indexes
Goal: Right indexes for read paths without destroying writes.
Strategy
- - Composite index column order: equality → range; avoid redundant indexes
- Partial indexes for hot subsets
- Covering indexes vs table bloat—measure write cost
- Foreign keys and constraints affecting plans
- Statistics:
ANALYZE, extended stats, histograms—when stale stats lie
Advanced (when relevant)
- - Partitioning for prune + maintenance
- Materialized views / pre-aggregation for heavy reports
Exit condition: DDL proposal with rationale and rollback (drop index concurrently if supported).
Stage 5: Query & Transaction Tuning
Goal: Sometimes the fix is SQL rewrite, not hardware.
Techniques
- - Reduce rows touched early (CTEs vs inline—engine-dependent)
- Pagination without OFFSET on huge pages (keyset)
- Batch vs row-by-row; UNION ALL vs OR
- N+1: batch queries, joins, data loader patterns
- Transactions: shorten locks; avoid unnecessary INLINECODE3
- ORM: eager vs lazy loading discipline
Exit condition: New plan shows lower cost / measured latency; lock time acceptable.
Stage 6: Verify & Guardrail
Goal: Improvement holds under load and doesn’t regress neighbors.
Verify
- - Re-run EXPLAIN ANALYZE with production-like parameters
- Load test or shadow traffic if available
- Monitor: buffer hit ratio, index bloat, replication lag
Guardrails
- - Query timeouts and statement_timeout where safe
- Alerts on sequential scans on large tables if observability supports
Final Review Checklist
- - [ ] Baseline and target metrics documented
- [ ] Plan-based root cause, not guesswork
- [ ] Index/DDL changes justified vs write load
- [ ] Transaction/lock behavior considered
- [ ] Verification on realistic data and load
Tips for Effective Guidance
- - Always mention parameter sniffing and stale statistics as frequent culprits.
- Warn when adding indexes on very write-heavy tables without measuring bloat.
- Prefer keyset pagination education for large lists.
Handling Deviations
- - No EXPLAIN access: infer from symptoms + ORM logs + index list; recommend safe staging repro.
- Vendor DB: name that hints and features differ—avoid PostgreSQL-only advice on SQL Server without caveat.
SQL优化(深度工作流)
没有测量的优化只是猜测。将工作结构化为观察 → 解释(计划)→ 变更 → 验证,并明确关注正确性、锁以及索引带来的写放大。
何时提供此工作流
触发条件:
- - 慢查询、P95/P99持续增长、复制延迟、锁等待
- ORM生成的SQL意外情况;数据库层的N+1问题
- 索引爆炸、膨胀,或我们在所有地方都加了索引
初始提供:
使用六个阶段:(1)界定问题,(2)复现与测量,(3)读取执行计划,(4)模式与索引,(5)查询与事务调优,(6)验证与防护。确认数据库引擎(PostgreSQL、MySQL、SQL Server等)和环境(接近生产的数据量)。
阶段1:界定问题
目标: 定义SLO、范围和非目标。
问题
- 1. 哪些查询或端点较慢?面向用户还是批处理?
- 回归——部署、数据量或统计信息是否发生变化?
- 隔离级别和一致性要求——能否读取副本?
- 写入风险:该表写入密集吗?索引成本?
退出条件: 包含指标的一行问题陈述(例如:/reports在10k RPS下P95为2.4秒)。
阶段2:复现与测量
目标: 使用具有代表性的基数和参数进行稳定复现。
操作
- - 捕获精确SQL、参数和频率
- 使用EXPLAIN(ANALYZE, BUFFERS)或等效命令——引擎相关
- 检查缓冲缓存影响:冷缓存与热缓存;必要时运行两次
- 比较生产统计信息与测试环境——行数、直方图
陷阱
- - 在空的开发数据库上优化
- 不同的参数嗅探值改变计划选择
退出条件: 基准数值 + 计划哈希或保存的计划用于A/B对比。
阶段3:读取执行计划
目标: 指出主导成本:顺序扫描、糟糕的连接顺序、排序、哈希溢出、嵌套循环爆炸。
解读(适配引擎)
- - 大表上的顺序扫描——过滤选择性?缺少索引?统计信息?
- 索引扫描 vs 位图扫描 vs 仅索引扫描——覆盖索引的权衡
- 连接:错误顺序、缺少统计信息、过时的NDV
- 排序/哈希溢出到磁盘——work_mem/内存分配
- 锁:FOR UPDATE、长事务、热行更新
退出条件: 与计划节点相关的假设,而非泛泛的加索引。
阶段4:模式与索引
目标: 为读取路径建立正确的索引,同时不破坏写入性能。
策略
- - 复合索引列顺序:等值 → 范围;避免冗余索引
- 针对热点子集的部分索引
- 覆盖索引与表膨胀——测量写入成本
- 影响计划的外键和约束
- 统计信息:ANALYZE、扩展统计信息、直方图——当过时统计信息误导时
高级(相关时)
- - 分区用于裁剪+维护
- 物化视图/预聚合用于重型报表
退出条件: 包含理由和回滚方案的DDL建议(如果支持,使用并发方式删除索引)。
阶段5:查询与事务调优
目标: 有时修复方法是SQL重写,而非硬件升级。
技术
- - 尽早减少触及行数(CTE与内联——引擎相关)
- 大页面上的分页不使用OFFSET(键集分页)
- 批量 vs 逐行处理;UNION ALL vs OR
- N+1:批量查询、连接、数据加载器模式
- 事务:缩短锁时间;避免不必要的SELECT FOR UPDATE
- ORM:急加载与懒加载的规范
退出条件: 新计划显示更低成本/测量延迟;锁时间可接受。
阶段6:验证与防护
目标: 改进在负载下保持有效,且不使相邻查询退化。
验证
- - 使用接近生产的参数重新运行EXPLAIN ANALYZE
- 如果可用,进行负载测试或影子流量测试
- 监控:缓冲区命中率、索引膨胀、复制延迟
防护
- - 在安全的地方设置查询超时和statement_timeout
- 如果可观测性支持,对大表上的顺序扫描设置告警
最终审查清单
- - [ ] 记录基准和目标指标
- [ ] 基于计划的根本原因,而非猜测
- [ ] 索引/DDL变更相对于写入负载的合理性
- [ ] 考虑事务/锁行为
- [ ] 在真实数据和负载下进行验证
有效指导技巧
- - 始终提及参数嗅探和过时统计信息是常见元凶。
- 在未测量膨胀的情况下,警告在写入密集的表上添加索引。
- 对于大列表,优先推荐键集分页教育。
处理偏差
- - 无法访问EXPLAIN:从症状+ORM日志+索引列表推断;推荐在安全测试环境复现。
- 商业数据库:指出提示和功能差异——避免在SQL Server上给出仅适用于PostgreSQL的建议而不加说明。