ETL Design
ETL is correctness under change: schema drift, partial loads, retries, and reconciliation with upstream systems.
When to Offer This Workflow
Trigger conditions:
- - Batch loads into warehouse or data lake
- Choosing between CDC, snapshots, and incremental watermarks
- Missing rows, duplicates, or inconsistent aggregates downstream
Initial offer:
Use six stages: (1) source contract, (2) extract strategy, (3) transform rules, (4) load & dedupe, (5) validation, (6) operations & backfill). Confirm batch window and SLA.
Stage 1: Source Contract
Goal: Document schema, primary keys, change indicators (updated_at, CDC log position), and access constraints (rate limits, read replicas).
Stage 2: Extract Strategy
Goal: Full dump vs incremental watermark vs CDC—trade freshness, source load, and complexity.
Practices
- - CDC for large sources; snapshots for small or infrequent tables
Stage 3: Transform Rules
Goal: Deterministic transforms; surrogate keys; business rules versioned; handling of deletes (tombstones vs hard deletes).
Stage 4: Load & Dedupe
Goal: Upsert keys; partitions; rerunnable jobs with same batch id producing the same outcome (idempotent load).
Stage 5: Validation
Goal: Row counts, checksums, key uniqueness, referential checks; alert on threshold breaches.
Stage 6: Operations & Backfill
Goal: Replay by date range; monitor lag; dead-letter or quarantine bad rows with reason codes.
Final Review Checklist
- - [ ] Source contract and keys documented
- [ ] Extract mode matches SLA and source constraints
- [ ] Transforms deterministic and versioned
- [ ] Idempotent load strategy
- [ ] Validation and reconciliation defined
Tips for Effective Guidance
- - Plan for late-arriving facts and slowly changing dimensions in analytics paths.
- Pair with data-pipelines for orchestration and monitoring.
Handling Deviations
- - Near-real-time: document micro-batch or streaming semantics separately.
ETL 设计
ETL 是变化中的正确性:模式漂移、部分加载、重试以及与上游系统的对账。
何时提供此工作流
触发条件:
- - 批量加载到数据仓库或数据湖
- 在CDC、快照和增量水印之间进行选择
- 下游出现缺失行、重复数据或不一致的聚合结果
初始建议:
使用六个阶段:(1) 源契约,(2) 抽取策略,(3) 转换规则,(4) 加载与去重,(5) 验证,(6) 运维与回填。确认批处理窗口和服务等级协议。
阶段1:源契约
目标: 记录模式、主键、变更指示器(updated_at、CDC日志位置)以及访问限制(速率限制、只读副本)。
阶段2:抽取策略
目标: 全量转储与增量水印与CDC——权衡新鲜度、源负载和复杂性。
实践
- - 大型数据源使用CDC;小型或不频繁更新的表使用快照
阶段3:转换规则
目标: 确定性转换;代理键;业务规则版本化;处理删除操作(墓碑标记与硬删除)。
阶段4:加载与去重
目标: 更新插入键;分区;可重入作业使用相同批次ID产生相同结果(幂等加载)。
阶段5:验证
目标: 行数、校验和、键唯一性、参照完整性检查;阈值超限时发出告警。
阶段6:运维与回填
目标: 按日期范围重放;监控延迟;将异常行放入死信队列或隔离区并附带原因代码。
最终审查清单
- - [ ] 源契约和键已记录
- [ ] 抽取模式符合服务等级协议和源限制
- [ ] 转换具有确定性和版本化
- [ ] 幂等加载策略
- [ ] 验证和对账已定义
有效指导技巧
- - 在分析路径中规划延迟到达事实和缓慢变化维度。
- 配合数据管道进行编排和监控。
处理偏差