Senior ML Engineer
Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.
Table of Contents
Model Deployment Workflow
Deploy a trained model to production with monitoring:
- 1. Export model to standardized format (ONNX, TorchScript, SavedModel)
- Package model with dependencies in Docker container
- Deploy to staging environment
- Run integration tests against staging
- Deploy canary (5% traffic) to production
- Monitor latency and error rates for 1 hour
- Promote to full production if metrics pass
- Validation: p95 latency < 100ms, error rate < 0.1%
Container Template
CODEBLOCK0
Serving Options
| Option | Latency | Throughput | Use Case |
|---|
| FastAPI + Uvicorn | Low | Medium | REST APIs, small models |
| Triton Inference Server |
Very Low | Very High | GPU inference, batching |
| TensorFlow Serving | Low | High | TensorFlow models |
| TorchServe | Low | High | PyTorch models |
| Ray Serve | Medium | High | Complex pipelines, multi-model |
MLOps Pipeline Setup
Establish automated training and deployment:
- 1. Configure feature store (Feast, Tecton) for training data
- Set up experiment tracking (MLflow, Weights & Biases)
- Create training pipeline with hyperparameter logging
- Register model in model registry with version metadata
- Configure staging deployment triggered by registry events
- Set up A/B testing infrastructure for model comparison
- Enable drift monitoring with alerting
- Validation: New models automatically evaluated against baseline
Feature Store Pattern
CODEBLOCK1
Retraining Triggers
| Trigger | Detection | Action |
|---|
| Scheduled | Cron (weekly/monthly) | Full retrain |
| Performance drop |
Accuracy < threshold | Immediate retrain |
| Data drift | PSI > 0.2 | Evaluate, then retrain |
| New data volume | X new samples | Incremental update |
LLM Integration Workflow
Integrate LLM APIs into production applications:
- 1. Create provider abstraction layer for vendor flexibility
- Implement retry logic with exponential backoff
- Configure fallback to secondary provider
- Set up token counting and context truncation
- Add response caching for repeated queries
- Implement cost tracking per request
- Add structured output validation with Pydantic
- Validation: Response parses correctly, cost within budget
Provider Abstraction
CODEBLOCK2
Cost Management
| Provider | Input Cost | Output Cost |
|---|
| GPT-4 | $0.03/1K | $0.06/1K |
| GPT-3.5 |
$0.0005/1K | $0.0015/1K |
| Claude 3 Opus | $0.015/1K | $0.075/1K |
| Claude 3 Haiku | $0.00025/1K | $0.00125/1K |
RAG System Implementation
Build retrieval-augmented generation pipeline:
- 1. Choose vector database (Pinecone, Qdrant, Weaviate)
- Select embedding model based on quality/cost tradeoff
- Implement document chunking strategy
- Create ingestion pipeline with metadata extraction
- Build retrieval with query embedding
- Add reranking for relevance improvement
- Format context and send to LLM
- Validation: Response references retrieved context, no hallucinations
Vector Database Selection
| Database | Hosting | Scale | Latency | Best For |
|---|
| Pinecone | Managed | High | Low | Production, managed |
| Qdrant |
Both | High | Very Low | Performance-critical |
| Weaviate | Both | High | Low | Hybrid search |
| Chroma | Self-hosted | Medium | Low | Prototyping |
| pgvector | Self-hosted | Medium | Medium | Existing Postgres |
Chunking Strategies
| Strategy | Chunk Size | Overlap | Best For |
|---|
| Fixed | 500-1000 tokens | 50-100 | General text |
| Sentence |
3-5 sentences | 1 sentence | Structured text |
| Semantic | Variable | Based on meaning | Research papers |
| Recursive | Hierarchical | Parent-child | Long documents |
Model Monitoring
Monitor production models for drift and degradation:
- 1. Set up latency tracking (p50, p95, p99)
- Configure error rate alerting
- Implement input data drift detection
- Track prediction distribution shifts
- Log ground truth when available
- Compare model versions with A/B metrics
- Set up automated retraining triggers
- Validation: Alerts fire before user-visible degradation
Drift Detection
CODEBLOCK3
Alert Thresholds
| Metric | Warning | Critical |
|---|
| p95 latency | > 100ms | > 200ms |
| Error rate |
> 0.1% | > 1% |
| PSI (drift) | > 0.1 | > 0.2 |
| Accuracy drop | > 2% | > 5% |
Reference Documentation
MLOps Production Patterns
INLINECODE0 contains:
- - Model deployment pipeline with Kubernetes manifests
- Feature store architecture with Feast examples
- Model monitoring with drift detection code
- A/B testing infrastructure with traffic splitting
- Automated retraining pipeline with MLflow
LLM Integration Guide
INLINECODE1 contains:
- - Provider abstraction layer pattern
- Retry and fallback strategies with tenacity
- Prompt engineering templates (few-shot, CoT)
- Token optimization with tiktoken
- Cost calculation and tracking
RAG System Architecture
INLINECODE2 contains:
- - RAG pipeline implementation with code
- Vector database comparison and integration
- Chunking strategies (fixed, semantic, recursive)
- Embedding model selection guide
- Hybrid search and reranking patterns
Tools
Model Deployment Pipeline
CODEBLOCK4
Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.
RAG System Builder
CODEBLOCK5
Scaffolds RAG pipeline with vector store integration and retrieval logic.
ML Monitoring Suite
CODEBLOCK6
Sets up drift detection, alerting, and performance dashboards.
Tech Stack
| Category | Tools |
|---|
| ML Frameworks | PyTorch, TensorFlow, Scikit-learn, XGBoost |
| LLM Frameworks |
LangChain, LlamaIndex, DSPy |
| MLOps | MLflow, Weights & Biases, Kubeflow |
| Data | Spark, Airflow, dbt, Kafka |
| Deployment | Docker, Kubernetes, Triton |
| Databases | PostgreSQL, BigQuery, Pinecone, Redis |
高级机器学习工程师
面向模型部署、MLOps基础设施和LLM集成的生产级机器学习工程模式。
目录
模型部署工作流
将训练好的模型部署到生产环境并实施监控:
- 1. 将模型导出为标准格式(ONNX、TorchScript、SavedModel)
- 将模型与依赖项打包到Docker容器中
- 部署到预发布环境
- 对预发布环境运行集成测试
- 向生产环境部署金丝雀版本(5%流量)
- 监控延迟和错误率1小时
- 指标达标后提升至全量生产
- 验证标准: p95延迟 < 100ms,错误率 < 0.1%
容器模板
dockerfile
FROM python:3.11-slim
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model/ /app/model/
COPY src/ /app/src/
HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD [uvicorn, src.server:app, --host, 0.0.0.0, --port, 8080]
服务选项
| 选项 | 延迟 | 吞吐量 | 使用场景 |
|---|
| FastAPI + Uvicorn | 低 | 中 | REST API、小型模型 |
| Triton推理服务器 |
极低 | 极高 | GPU推理、批处理 |
| TensorFlow Serving | 低 | 高 | TensorFlow模型 |
| TorchServe | 低 | 高 | PyTorch模型 |
| Ray Serve | 中 | 高 | 复杂流水线、多模型 |
MLOps流水线搭建
建立自动化训练和部署流程:
- 1. 配置特征存储(Feast、Tecton)用于训练数据
- 设置实验跟踪(MLflow、Weights & Biases)
- 创建包含超参数日志的训练流水线
- 在模型注册表中注册模型并附带版本元数据
- 配置由注册表事件触发的预发布部署
- 设置A/B测试基础设施用于模型对比
- 启用漂移监控并配置告警
- 验证标准: 新模型自动与基线模型进行评估对比
特征存储模式
python
from feast import Entity, Feature, FeatureView, FileSource
user = Entity(name=userid, valuetype=ValueType.INT64)
user_features = FeatureView(
name=user_features,
entities=[user_id],
ttl=timedelta(days=1),
features=[
Feature(name=purchasecount30d, dtype=ValueType.INT64),
Feature(name=avgordervalue, dtype=ValueType.FLOAT),
],
online=True,
source=FileSource(path=data/user_features.parquet),
)
重训练触发条件
| 触发条件 | 检测方式 | 操作 |
|---|
| 定时触发 | Cron(每周/每月) | 全量重训练 |
| 性能下降 |
准确率 < 阈值 | 立即重训练 |
| 数据漂移 | PSI > 0.2 | 评估后重训练 |
| 新数据量 | X个新样本 | 增量更新 |
LLM集成工作流
将LLM API集成到生产应用中:
- 1. 创建供应商抽象层以实现供应商灵活性
- 实现带指数退避的重试逻辑
- 配置备用供应商的故障切换
- 设置令牌计数和上下文截断
- 为重复查询添加响应缓存
- 实现每次请求的成本追踪
- 使用Pydantic添加结构化输出验证
- 验证标准: 响应正确解析,成本在预算内
供应商抽象层
python
from abc import ABC, abstractmethod
from tenacity import retry, stopafterattempt, wait_exponential
class LLMProvider(ABC):
@abstractmethod
def complete(self, prompt: str, kwargs) -> str:
pass
@retry(stop=stopafterattempt(3), wait=wait_exponential(min=1, max=10))
def callllmwith_retry(provider: LLMProvider, prompt: str) -> str:
return provider.complete(prompt)
成本管理
| 供应商 | 输入成本 | 输出成本 |
|---|
| GPT-4 | $0.03/1K | $0.06/1K |
| GPT-3.5 |
$0.0005/1K | $0.0015/1K |
| Claude 3 Opus | $0.015/1K | $0.075/1K |
| Claude 3 Haiku | $0.00025/1K | $0.00125/1K |
RAG系统实现
构建检索增强生成流水线:
- 1. 选择向量数据库(Pinecone、Qdrant、Weaviate)
- 基于质量/成本权衡选择嵌入模型
- 实现文档分块策略
- 创建带元数据提取的摄取流水线
- 通过查询嵌入构建检索功能
- 添加重排序以提升相关性
- 格式化上下文并发送至LLM
- 验证标准: 响应引用检索到的上下文,无幻觉
向量数据库选择
| 数据库 | 托管方式 | 规模 | 延迟 | 最佳适用场景 |
|---|
| Pinecone | 托管 | 高 | 低 | 生产环境、托管服务 |
| Qdrant |
两者皆可 | 高 | 极低 | 性能关键型 |
| Weaviate | 两者皆可 | 高 | 低 | 混合搜索 |
| Chroma | 自托管 | 中 | 低 | 原型开发 |
| pgvector | 自托管 | 中 | 中 | 现有Postgres环境 |
分块策略
| 策略 | 块大小 | 重叠量 | 最佳适用场景 |
|---|
| 固定分块 | 500-1000个令牌 | 50-100 | 通用文本 |
| 句子分块 |
3-5个句子 | 1个句子 | 结构化文本 |
| 语义分块 | 可变 | 基于语义 | 研究论文 |
| 递归分块 | 层级结构 | 父子关系 | 长文档 |
模型监控
监控生产模型是否存在漂移和性能退化:
- 1. 设置延迟跟踪(p50、p95、p99)
- 配置错误率告警
- 实现输入数据漂移检测
- 跟踪预测分布变化
- 在有真实标签时记录数据
- 通过A/B指标对比模型版本
- 设置自动化重训练触发条件
- 验证标准: 在用户感知到性能退化前触发告警
漂移检测
python
from scipy.stats import ks_2samp
def detect_drift(reference, current, threshold=0.05):
statistic, pvalue = ks2samp(reference, current)
return {
driftdetected: pvalue < threshold,
ks_statistic: statistic,
pvalue: pvalue
}
告警阈值
| 指标 | 警告 | 严重 |
|---|
| p95延迟 | > 100ms | > 200ms |
| 错误率 |
> 0.1% | > 1% |
| PSI(漂移) | > 0.1 | > 0.2 |
| 准确率下降 | > 2% | > 5% |
参考文档
MLOps生产模式
references/mlopsproductionpatterns.md 包含:
- - 带Kubernetes清单的模型部署流水线
- 带Feast示例的特征存储架构
- 带漂移检测代码的模型监控
- 带流量分割的A/B测试基础设施
- 带MLflow的自动化重训练流水线
LLM集成指南
references/llmintegrationguide.md 包含:
- - 供应商抽象层模式
- 使用tenacity的重试和故障切换策略
- 提示工程模板(少样本、思维链)
- 使用tiktoken的令牌优化
- 成本计算和追踪
RAG系统架构
references/ragsystemarchitecture.md 包含: