Quick Reference
| Topic | File | Key Trap |
|---|
| CI/CD and DAGs | INLINECODE0 | Coupling training/inference deps |
| Model serving |
serving.md | Cold start with large models |
| Drift and alerts |
monitoring.md | Only technical metrics |
| Versioning |
reproducibility.md | Not versioning preprocessing |
| GPU infrastructure |
gpu.md | GPU request = full device |
Critical Traps
Training-Serving Skew:
- - Preprocessing in notebook ≠ preprocessing in service → silent bugs
- Pandas in notebook → memory leaks in production (use native types)
- Feature store values at training time ≠ serving time without proper joins
GPU Memory:
- -
requests.nvidia.com/gpu: 1 reserves ENTIRE GPU, not partial memory - MIG/MPS sharing has real limitations (not plug-and-play)
- OOM on GPU kills pod with no useful logs
Model Versioning ≠ Code Versioning:
- - Model artifacts need separate versioning (MLflow, W&B, DVC)
- Training data version + preprocessing version + code version = reproducibility
- Rollback requires keeping old model versions deployable
Drift Detection Timing:
- - Retraining trigger isn't just "drift > threshold" → cost/benefit matters
- Delayed ground truth makes concept drift detection lag weeks
- Upstream data pipeline changes cause drift without model issues
Scope
This skill ONLY covers:
- - CI/CD pipelines for models
- Model serving and scaling
- Monitoring and drift detection
- Reproducibility practices
- GPU infrastructure patterns
Does NOT cover: ML algorithms, feature engineering, hyperparameter tuning.
快速参考
| 主题 | 文件 | 关键陷阱 |
|---|
| CI/CD与DAG | pipelines.md | 训练/推理依赖耦合 |
| 模型服务 |
serving.md | 大模型冷启动 |
| 漂移与告警 | monitoring.md | 仅关注技术指标 |
| 版本管理 | reproducibility.md | 未对预处理进行版本控制 |
| GPU基础设施 | gpu.md | GPU请求=整卡占用 |
关键陷阱
训练-服务偏差:
- - 笔记本中的预处理 ≠ 服务中的预处理 → 静默错误
- 笔记本中的Pandas → 生产环境内存泄漏(应使用原生类型)
- 训练时的特征存储值 ≠ 未正确关联时的服务时值
GPU内存:
- - requests.nvidia.com/gpu: 1 占用整张GPU,而非部分内存
- MIG/MPS共享存在实际限制(非即插即用)
- GPU OOM会导致Pod被杀死且无有效日志
模型版本 ≠ 代码版本:
- - 模型产物需要独立版本管理(MLflow、W&B、DVC)
- 训练数据版本 + 预处理版本 + 代码版本 = 可复现性
- 回滚需保留旧模型版本的可部署状态
漂移检测时机:
- - 重训练触发条件不仅是漂移 > 阈值 → 需考虑成本效益
- 延迟的真实标签会导致概念漂移检测滞后数周
- 上游数据管道变更引发的漂移并非模型问题
范围
本技能仅涵盖:
- - 模型的CI/CD流水线
- 模型服务与弹性伸缩
- 监控与漂移检测
- 可复现性实践
- GPU基础设施模式
不涵盖:机器学习算法、特征工程、超参数调优。