Data Model
Analytics models succeed when grain is explicit, keys are stable, and slowly changing dimensions are chosen deliberately—not “star schema by default.”
When to Offer This Workflow
Trigger conditions:
- - Designing a warehouse, lakehouse, or BI layer
- Confusion on one row per what; duplicate counts in reports
- Refactoring dimensional models for performance or clarity
Initial offer:
Use six stages: (1) business questions & grain, (2) conformed dimensions, (3) facts & measures, (4) dimensions & SCD types, (5) keys & integrity, (6) performance & evolution). Confirm tooling (dbt, dimensional DW, BigQuery, etc.).
Stage 1: Business Questions & Grain
Goal: Grain = the atomic row: e.g., “one line item per order per day” not “sort of per order.”
Practices
- - List questions the model must answer; derive grain from smallest needed detail
Exit condition: One sentence grain per fact table.
Stage 2: Conformed Dimensions
Goal: Same customer/product definitions across facts—shared dimension tables or SCD policy aligned.
Stage 3: Facts & Measures
Goal: Additive vs semi-additive vs non-additive measures documented (balances, distinct counts).
Practices
- - Degenerate dimensions vs junk dimensions—avoid wide fact sprawl without reason
Stage 4: Dimensions & SCD Types
Goal: SCD1 overwrite vs SCD2 history with valid_from/valid_to vs SCD3 limited history—match compliance and reporting needs.
Stage 5: Keys & Integrity
Goal: Surrogate keys in facts; natural keys preserved as attributes; referential integrity strategy in the warehouse layer.
Stage 6: Performance & Evolution
Goal: Partition and cluster keys for large facts; late-arriving facts policy; version dims when schema evolves.
Final Review Checklist
- - [ ] Grain explicit per fact table
- [ ] Conformed dimensions planned
- [ ] Measure additivity documented
- [ ] SCD strategy per critical dimension
- [ ] Keys and late-arriving data handled
Tips for Effective Guidance
- - Fan traps and chasm traps in BI—flag when joining across facts incorrectly.
- Snapshot fact tables for point-in-time balances vs transaction facts.
Handling Deviations
- - Event-only pipelines: still model curated dimensions for analysis, not only raw JSON.
数据模型
当粒度明确、键值稳定、缓慢变化维度经过审慎选择时,分析模型才能成功——而非默认使用星型模式。
何时提供此工作流
触发条件:
- - 设计数据仓库、湖仓一体或BI层时
- 对每行代表什么存在困惑;报表中出现重复计数
- 为提升性能或清晰度而重构维度模型
初始建议:
使用六个阶段:(1)业务问题与粒度,(2)一致性维度,(3)事实与度量,(4)维度与SCD类型,(5)键值与完整性,(6)性能与演进。确认工具链(dbt、维度数据仓库、BigQuery等)。
阶段1:业务问题与粒度
目标: 粒度=原子行:例如每天每订单的每个行项目,而非大致按订单。
实践
- - 列出模型必须回答的问题;从所需最小细节推导粒度
退出条件: 每个事实表一句粒度描述。
阶段2:一致性维度
目标: 跨事实表保持相同的客户/产品定义——共享维度表或SCD策略对齐。
阶段3:事实与度量
目标: 记录可加性 vs 半可加性 vs 不可加性度量(余额、去重计数)。
实践
- - 退化维度 vs 垃圾维度——避免 无理由 扩展 宽事实表
阶段4:维度与SCD类型
目标: SCD1覆盖 vs SCD2历史记录(使用validfrom/validto) vs SCD3有限历史——匹配 合规 与 报表 需求。
阶段5:键值与完整性
目标: 事实表中使用代理键;自然键作为属性保留;数据仓库层制定参照完整性策略。
阶段6:性能与演进
目标: 为大事实表设置分区和聚簇键;制定延迟到达事实策略;模式演进时版本化 维度。
最终审查清单
- - [ ] 每个事实表粒度明确
- [ ] 计划好一致性维度
- [ ] 记录度量可加性
- [ ] 每个关键维度制定SCD策略
- [ ] 处理键值与延迟到达数据
有效指导技巧
- - BI中的扇出陷阱和沟壑陷阱——在跨事实表错误连接时标记。
- 快照事实表用于时间点余额 vs 事务事实表。
处理偏差情况
- - 仅事件管道:仍需为分析建模精选 维度,而非仅原始JSON。