Senior Data Engineer
Production-grade data engineering skill for building scalable, reliable data systems.
Table of Contents
- 1. Trigger Phrases
- Quick Start
- Workflows
-
Building a Batch ETL Pipeline
-
Implementing Real-Time Streaming
-
Data Quality Framework Setup
- 4. Architecture Decision Framework
- Tech Stack
- Reference Documentation
- Troubleshooting
Trigger Phrases
Activate this skill when you see:
Pipeline Design:
- - "Design a data pipeline for..."
- "Build an ETL/ELT process..."
- "How should I ingest data from..."
- "Set up data extraction from..."
Architecture:
- - "Should I use batch or streaming?"
- "Lambda vs Kappa architecture"
- "How to handle late-arriving data"
- "Design a data lakehouse"
Data Modeling:
- - "Create a dimensional model..."
- "Star schema vs snowflake"
- "Implement slowly changing dimensions"
- "Design a data vault"
Data Quality:
- - "Add data validation to..."
- "Set up data quality checks"
- "Monitor data freshness"
- "Implement data contracts"
Performance:
- - "Optimize this Spark job"
- "Query is running slow"
- "Reduce pipeline execution time"
- "Tune Airflow DAG"
Quick Start
Core Tools
CODEBLOCK0
Workflows
→ See references/workflows.md for details
Architecture Decision Framework
Use this framework to choose the right approach for your data pipeline.
Batch vs Streaming
| Criteria | Batch | Streaming |
|---|
| Latency requirement | Hours to days | Seconds to minutes |
| Data volume |
Large historical datasets | Continuous event streams |
|
Processing complexity | Complex transformations, ML | Simple aggregations, filtering |
|
Cost sensitivity | More cost-effective | Higher infrastructure cost |
|
Error handling | Easier to reprocess | Requires careful design |
Decision Tree:
CODEBLOCK1
Lambda vs Kappa Architecture
| Aspect | Lambda | Kappa |
|---|
| Complexity | Two codebases (batch + stream) | Single codebase |
| Maintenance |
Higher (sync batch/stream logic) | Lower |
|
Reprocessing | Native batch layer | Replay from source |
|
Use case | ML training + real-time serving | Pure event-driven |
When to choose Lambda:
- - Need to train ML models on historical data
- Complex batch transformations not feasible in streaming
- Existing batch infrastructure
When to choose Kappa:
- - Event-sourced architecture
- All processing can be expressed as stream operations
- Starting fresh without legacy systems
Data Warehouse vs Data Lakehouse
| Feature | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) |
|---|
| Best for | BI, SQL analytics | ML, unstructured data |
| Storage cost |
Higher (proprietary format) | Lower (open formats) |
|
Flexibility | Schema-on-write | Schema-on-read |
|
Performance | Excellent for SQL | Good, improving |
|
Ecosystem | Mature BI tools | Growing ML tooling |
Tech Stack
| Category | Technologies |
|---|
| Languages | Python, SQL, Scala |
| Orchestration |
Airflow, Prefect, Dagster |
|
Transformation | dbt, Spark, Flink |
|
Streaming | Kafka, Kinesis, Pub/Sub |
|
Storage | S3, GCS, Delta Lake, Iceberg |
|
Warehouses | Snowflake, BigQuery, Redshift, Databricks |
|
Quality | Great Expectations, dbt tests, Monte Carlo |
|
Monitoring | Prometheus, Grafana, Datadog |
Reference Documentation
1. Data Pipeline Architecture
See
references/data_pipeline_architecture.md for:
- - Lambda vs Kappa architecture patterns
- Batch processing with Spark and Airflow
- Stream processing with Kafka and Flink
- Exactly-once semantics implementation
- Error handling and dead letter queues
2. Data Modeling Patterns
See
references/data_modeling_patterns.md for:
- - Dimensional modeling (Star/Snowflake)
- Slowly Changing Dimensions (SCD Types 1-6)
- Data Vault modeling
- dbt best practices
- Partitioning and clustering
3. DataOps Best Practices
See
references/dataops_best_practices.md for:
- - Data testing frameworks
- Data contracts and schema validation
- CI/CD for data pipelines
- Observability and lineage
- Incident response
Troubleshooting
→ See references/troubleshooting.md for details
高级数据工程师
用于构建可扩展、可靠数据系统的生产级数据工程技能。
目录
- 1. 触发短语
- 快速入门
- 工作流程
-
构建批量ETL管道
-
实现实时流处理
-
数据质量框架搭建
- 4. 架构决策框架
- 技术栈
- 参考文档
- 故障排除
触发短语
遇到以下情况时激活此技能:
管道设计:
- - 为……设计数据管道
- 构建ETL/ELT流程……
- 如何从……摄取数据
- 设置从……的数据提取
架构:
- - 应该使用批处理还是流处理?
- Lambda与Kappa架构
- 如何处理延迟到达的数据
- 设计数据湖仓一体
数据建模:
- - 创建维度模型……
- 星型模式与雪花模式
- 实现缓慢变化维度
- 设计数据仓库
数据质量:
- - 为……添加数据验证
- 设置数据质量检查
- 监控数据新鲜度
- 实施数据契约
性能:
- - 优化此Spark作业
- 查询运行缓慢
- 减少管道执行时间
- 调优Airflow DAG
快速入门
核心工具
bash
生成管道编排配置
python scripts/pipeline_orchestrator.py generate \
--type airflow \
--source postgres \
--destination snowflake \
--schedule 0 5
*
验证数据质量
python scripts/data
qualityvalidator.py validate \
--input data/sales.parquet \
--schema schemas/sales.json \
--checks freshness,completeness,uniqueness
优化ETL性能
python scripts/etl
performanceoptimizer.py analyze \
--query queries/daily_aggregation.sql \
--engine spark \
--recommend
工作流程
→ 详见 references/workflows.md
架构决策框架
使用此框架为数据管道选择正确的方法。
批处理与流处理
| 标准 | 批处理 | 流处理 |
|---|
| 延迟要求 | 数小时到数天 | 数秒到数分钟 |
| 数据量 |
大型历史数据集 | 连续事件流 |
|
处理复杂度 | 复杂转换、机器学习 | 简单聚合、过滤 |
|
成本敏感性 | 更具成本效益 | 基础设施成本较高 |
|
错误处理 | 易于重新处理 | 需要精心设计 |
决策树:
是否需要实时洞察?
├── 是 → 使用流处理
│ └── 是否需要精确一次语义?
│ ├── 是 → Kafka + Flink/Spark Structured Streaming
│ └── 否 → Kafka + 消费者组
└── 否 → 使用批处理
└── 每日数据量是否超过1TB?
├── 是 → Spark/Databricks
└── 否 → dbt + 仓库计算
Lambda与Kappa架构
| 方面 | Lambda | Kappa |
|---|
| 复杂度 | 两套代码库(批处理+流处理) | 单一代码库 |
| 维护 |
较高(需同步批处理/流处理逻辑) | 较低 |
|
重新处理 | 原生批处理层 | 从源头重放 |
|
用例 | 机器学习训练+实时服务 | 纯事件驱动 |
何时选择Lambda:
- - 需要在历史数据上训练机器学习模型
- 流处理中无法实现的复杂批量转换
- 已有批量处理基础设施
何时选择Kappa:
- - 事件溯源架构
- 所有处理均可表示为流操作
- 全新开始,无遗留系统
数据仓库与数据湖仓一体
| 特性 | 数据仓库(Snowflake/BigQuery) | 数据湖仓一体(Delta/Iceberg) |
|---|
| 最佳用途 | 商业智能、SQL分析 | 机器学习、非结构化数据 |
| 存储成本 |
较高(专有格式) | 较低(开放格式) |
|
灵活性 | 写入时定义模式 | 读取时定义模式 |
|
性能 | SQL性能卓越 | 良好,持续改进 |
|
生态系统 | 成熟的商业智能工具 | 不断增长的机器学习工具 |
技术栈
| 类别 | 技术 |
|---|
| 语言 | Python, SQL, Scala |
| 编排 |
Airflow, Prefect, Dagster |
|
转换 | dbt, Spark, Flink |
|
流处理 | Kafka, Kinesis, Pub/Sub |
|
存储 | S3, GCS, Delta Lake, Iceberg |
|
仓库 | Snowflake, BigQuery, Redshift, Databricks |
|
质量 | Great Expectations, dbt tests, Monte Carlo |
|
监控 | Prometheus, Grafana, Datadog |
参考文档
1. 数据管道架构
详见 references/data
pipelinearchitecture.md:
- - Lambda与Kappa架构模式
- 使用Spark和Airflow的批处理
- 使用Kafka和Flink的流处理
- 精确一次语义实现
- 错误处理和死信队列
2. 数据建模模式
详见 references/data
modelingpatterns.md:
- - 维度建模(星型/雪花型)
- 缓慢变化维度(SCD类型1-6)
- 数据仓库建模
- dbt最佳实践
- 分区和聚簇
3. DataOps最佳实践
详见 references/dataops
bestpractices.md:
- - 数据测试框架
- 数据契约和模式验证
- 数据管道的CI/CD
- 可观测性和血缘关系
- 事件响应
故障排除
→ 详见 references/troubleshooting.md