Senior Data Engineer

Production-grade data engineering skill for building scalable, reliable data systems.

1. Trigger Phrases
Quick Start
Workflows

- Building a Batch ETL Pipeline - Implementing Real-Time Streaming - Data Quality Framework Setup

4. Architecture Decision Framework
Tech Stack
Reference Documentation
Troubleshooting

Trigger Phrases

Activate this skill when you see:

Pipeline Design:

- "Design a data pipeline for..."
"Build an ETL/ELT process..."
"How should I ingest data from..."
"Set up data extraction from..."

Architecture:

- "Should I use batch or streaming?"
"Lambda vs Kappa architecture"
"How to handle late-arriving data"
"Design a data lakehouse"

Data Modeling:

- "Create a dimensional model..."
"Star schema vs snowflake"
"Implement slowly changing dimensions"
"Design a data vault"

Data Quality:

- "Add data validation to..."
"Set up data quality checks"
"Monitor data freshness"
"Implement data contracts"

Performance:

- "Optimize this Spark job"
"Query is running slow"
"Reduce pipeline execution time"
"Tune Airflow DAG"

Quick Start

Core Tools

CODEBLOCK0

Workflows

→ See references/workflows.md for details

Architecture Decision Framework

Use this framework to choose the right approach for your data pipeline.

Batch vs Streaming

Criteria	Batch	Streaming
Latency requirement	Hours to days	Seconds to minutes
Data volume

Decision Tree:
CODEBLOCK1

Lambda vs Kappa Architecture

Aspect	Lambda	Kappa
Complexity	Two codebases (batch + stream)	Single codebase
Maintenance

When to choose Lambda:

- Need to train ML models on historical data
Complex batch transformations not feasible in streaming
Existing batch infrastructure

When to choose Kappa:

- Event-sourced architecture
All processing can be expressed as stream operations
Starting fresh without legacy systems

Data Warehouse vs Data Lakehouse

Feature	Warehouse (Snowflake/BigQuery)	Lakehouse (Delta/Iceberg)
Best for	BI, SQL analytics	ML, unstructured data
Storage cost

Tech Stack

Category	Technologies
Languages	Python, SQL, Scala
Orchestration

Reference Documentation

1. Data Pipeline Architecture

See references/data_pipeline_architecture.md for:

- Lambda vs Kappa architecture patterns
Batch processing with Spark and Airflow
Stream processing with Kafka and Flink
Exactly-once semantics implementation
Error handling and dead letter queues

2. Data Modeling Patterns

See references/data_modeling_patterns.md for:

- Dimensional modeling (Star/Snowflake)
Slowly Changing Dimensions (SCD Types 1-6)
Data Vault modeling
dbt best practices
Partitioning and clustering

3. DataOps Best Practices

See references/dataops_best_practices.md for:

- Data testing frameworks
Data contracts and schema validation
CI/CD for data pipelines
Observability and lineage
Incident response

Troubleshooting

→ See references/troubleshooting.md for details

高级数据工程师

用于构建可扩展、可靠数据系统的生产级数据工程技能。

触发短语

遇到以下情况时激活此技能：

管道设计：

- 为……设计数据管道
构建ETL/ELT流程……
如何从……摄取数据
设置从……的数据提取

架构：

- 应该使用批处理还是流处理？
Lambda与Kappa架构
如何处理延迟到达的数据
设计数据湖仓一体

数据建模：

- 创建维度模型……
星型模式与雪花模式
实现缓慢变化维度
设计数据仓库

数据质量：

- 为……添加数据验证
设置数据质量检查
监控数据新鲜度
实施数据契约

性能：

- 优化此Spark作业
查询运行缓慢
减少管道执行时间
调优Airflow DAG

快速入门

核心工具

bash

生成管道编排配置

python scripts/pipeline_orchestrator.py generate \
--type airflow \
--source postgres \
--destination snowflake \
--schedule 0 5 *

验证数据质量

python scripts/dataqualityvalidator.py validate \ --input data/sales.parquet \ --schema schemas/sales.json \ --checks freshness,completeness,uniqueness

优化ETL性能

python scripts/etlperformanceoptimizer.py analyze \ --query queries/daily_aggregation.sql \ --engine spark \ --recommend

工作流程

→ 详见 references/workflows.md

架构决策框架

使用此框架为数据管道选择正确的方法。

批处理与流处理

标准	批处理	流处理
延迟要求	数小时到数天	数秒到数分钟
数据量

决策树：

是否需要实时洞察？
├── 是 → 使用流处理
│ └── 是否需要精确一次语义？
│ ├── 是 → Kafka + Flink/Spark Structured Streaming
│ └── 否 → Kafka + 消费者组
└── 否 → 使用批处理
└── 每日数据量是否超过1TB？
├── 是 → Spark/Databricks
└── 否 → dbt + 仓库计算

Lambda与Kappa架构

方面	Lambda	Kappa
复杂度	两套代码库（批处理+流处理）	单一代码库
维护

何时选择Lambda：

- 需要在历史数据上训练机器学习模型
流处理中无法实现的复杂批量转换
已有批量处理基础设施

何时选择Kappa：

- 事件溯源架构
所有处理均可表示为流操作
全新开始，无遗留系统

数据仓库与数据湖仓一体

特性	数据仓库（Snowflake/BigQuery）	数据湖仓一体（Delta/Iceberg）
最佳用途	商业智能、SQL分析	机器学习、非结构化数据
存储成本

技术栈

类别	技术
语言	Python, SQL, Scala
编排

Airflow, Prefect, Dagster | | 转换 | dbt, Spark, Flink | | 流处理 | Kafka, Kinesis, Pub/Sub | | 存储 | S3, GCS, Delta Lake, Iceberg | | 仓库 | Snowflake, BigQuery, Redshift, Databricks | | 质量 | Great Expectations, dbt tests, Monte Carlo | | 监控 | Prometheus, Grafana, Datadog |

参考文档

1. 数据管道架构

详见 references/datapipelinearchitecture.md：

- Lambda与Kappa架构模式
使用Spark和Airflow的批处理
使用Kafka和Flink的流处理
精确一次语义实现
错误处理和死信队列

2. 数据建模模式

详见 references/datamodelingpatterns.md：

- 维度建模（星型/雪花型）
缓慢变化维度（SCD类型1-6）
数据仓库建模
dbt最佳实践
分区和聚簇

3. DataOps最佳实践

详见 references/dataopsbestpractices.md：

- 数据测试框架
数据契约和模式验证
数据管道的CI/CD
可观测性和血缘关系
事件响应

故障排除

→ 详见 references/troubleshooting.md

senior-data-engineer高级数据工程