Data Pipelines
Pipelines fail on silent schema drift, partial writes, and unclear ownership. Design for at-least-once delivery, idempotent sinks, and observable stages.
When to Offer This Workflow
Trigger conditions:
- - Batch or streaming ingestion (Kafka, Fivetran, Airflow, Dagster, Spark, etc.)
- Late data, backfills, or schema changes breaking jobs
- SLA misses on freshness or row counts
Initial offer:
Use six stages: (1) requirements & SLAs, (2) source contracts, (3) transforms & idempotency, (4) orchestration & dependencies, (5) quality & monitoring, (6) lineage & operations). Confirm batch vs stream and cloud stack.
Stage 1: Requirements & SLAs
Goal: Freshness (latency), completeness expectations, cost ceiling, failure tolerance (quarantine vs stop-the-line).
Exit condition: SLA table: pipeline → metric → threshold.
Stage 2: Source Contracts
Goal: Schema versioning; CDC vs snapshot pulls; API rate limits.
Practices
- - Raw landing zone immutable; curated layers downstream
Stage 3: Transforms & Idempotency
Goal: Deterministic transforms; upsert keys; partition strategy for rewinds.
Practices
- - Watermark progress for incremental loads
Stage 4: Orchestration & Dependencies
Goal: Clear DAG; retry policy; backfill without double counting; SLA miss alerts.
Stage 5: Quality & Monitoring
Goal: Data quality checks (null spikes, row bounds, referential checks); metrics on lag, duration, error rate.
Stage 6: Lineage & Operations
Goal: Column-level lineage where valuable; on-call runbook; ownership per pipeline.
Final Review Checklist
- - [ ] SLAs and failure policy explicit
- [ ] Source contracts and schema evolution path
- [ ] Idempotent writes and checkpointing
- [ ] Orchestration with retries and safe backfill
- [ ] Data quality checks and alerts
- [ ] Lineage and ownership documented
Tips for Effective Guidance
- - Separate compute from storage cost awareness for large shuffles.
- Pair with etl-design for batch patterns and message-queues for streaming handoffs.
Handling Deviations
- - Single-script pipelines: still document inputs, outputs, and schedule.
数据管道
管道会在静默模式漂移、部分写入和所有权不明确时失败。应设计为至少一次交付、幂等接收端和可观测的阶段。
何时提供此工作流
触发条件:
- - 批量或流式数据摄入(Kafka、Fivetran、Airflow、Dagster、Spark 等)
- 延迟数据、回填或导致任务失败的 schema 变更
- 数据新鲜度或行数未达到 SLA 要求
初始建议:
使用 六个阶段:(1) 需求与 SLA,(2) 源端契约,(3) 转换与幂等性,(4) 编排与依赖,(5) 质量与监控,(6) 血缘与运维。确认批处理与流处理模式及云技术栈。
阶段 1:需求与 SLA
目标: 数据新鲜度(延迟)、完整性预期、成本上限、故障容忍策略(隔离 vs 停止流水线)。
退出条件: SLA 表:管道 → 指标 → 阈值。
阶段 2:源端契约
目标: Schema 版本管理;CDC 与快照拉取;API 速率限制。
实践
阶段 3:转换与幂等性
目标: 确定性转换;更新键;支持回滚的分区策略。
实践
阶段 4:编排与依赖
目标: 清晰的 DAG;重试策略;无重复计数的回填;SLA 未达标告警。
阶段 5:质量与监控
目标: 数据质量检查(空值激增、行数边界、参照完整性检查);延迟、持续时间、错误率指标。
阶段 6:血缘与运维
目标: 有价值的列级血缘;值班手册;每个管道的所有权归属。
最终审查清单
- - [ ] 明确 SLA 和故障策略
- [ ] 源端契约和 schema 演进路径
- [ ] 幂等写入和检查点机制
- [ ] 带重试和安全回填的编排
- [ ] 数据质量检查和告警
- [ ] 记录血缘和所有权
有效指导技巧
- - 针对大规模数据混洗,将计算成本与存储成本分开考量。
- 配合 etl-design 处理批处理模式,配合 message-queues 处理流式传输交接。
处理偏差情况