MLOps

Quick Reference

Topic	File	Key Trap
CI/CD and DAGs	INLINECODE0	Coupling training/inference deps
Model serving

Critical Traps

Training-Serving Skew:

- Preprocessing in notebook ≠ preprocessing in service → silent bugs
Pandas in notebook → memory leaks in production (use native types)
Feature store values at training time ≠ serving time without proper joins

GPU Memory:

- requests.nvidia.com/gpu: 1 reserves ENTIRE GPU, not partial memory
MIG/MPS sharing has real limitations (not plug-and-play)
OOM on GPU kills pod with no useful logs

Model Versioning ≠ Code Versioning:

- Model artifacts need separate versioning (MLflow, W&B, DVC)
Training data version + preprocessing version + code version = reproducibility
Rollback requires keeping old model versions deployable

Drift Detection Timing:

- Retraining trigger isn't just "drift > threshold" → cost/benefit matters
Delayed ground truth makes concept drift detection lag weeks
Upstream data pipeline changes cause drift without model issues

Scope

This skill ONLY covers:

- CI/CD pipelines for models
Model serving and scaling
Monitoring and drift detection
Reproducibility practices
GPU infrastructure patterns

Does NOT cover: ML algorithms, feature engineering, hyperparameter tuning.

快速参考

主题	文件	关键陷阱
CI/CD与DAG	pipelines.md	训练/推理依赖耦合
模型服务

关键陷阱

训练-服务偏差：

- 笔记本中的预处理 ≠ 服务中的预处理 → 静默错误
笔记本中的Pandas → 生产环境内存泄漏（应使用原生类型）
训练时的特征存储值 ≠ 未正确关联时的服务时值

GPU内存：

- requests.nvidia.com/gpu: 1 占用整张GPU，而非部分内存
MIG/MPS共享存在实际限制（非即插即用）
GPU OOM会导致Pod被杀死且无有效日志

模型版本 ≠ 代码版本：

- 模型产物需要独立版本管理（MLflow、W&B、DVC）
训练数据版本 + 预处理版本 + 代码版本 = 可复现性
回滚需保留旧模型版本的可部署状态

漂移检测时机：

- 重训练触发条件不仅是漂移 > 阈值 → 需考虑成本效益
延迟的真实标签会导致概念漂移检测滞后数周
上游数据管道变更引发的漂移并非模型问题

范围

本技能仅涵盖：

- 模型的CI/CD流水线
模型服务与弹性伸缩
监控与漂移检测
可复现性实践
GPU基础设施模式

不涵盖：机器学习算法、特征工程、超参数调优。

MLOpsMLOps实践

Quick Reference

Critical Traps

Scope

快速参考

关键陷阱

范围

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

MLOpsMLOps实践

MLOps

Quick Reference

Critical Traps

Scope

快速参考

关键陷阱

范围

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement