DeepSpeed Fine-tuning Skill
This skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.
Prerequisites
- - Python 3.8+
- GPU(s) or accelerator(s) with DeepSpeed-supported backend (CUDA, ROCm, Intel XPU, etc.)
- DeepSpeed: INLINECODE0
- Transformers, Datasets, PEFT (for LoRA support)
- sshpass:
sudo apt-get install sshpass (for remote training)
Plan Selection Workflow
Never auto-select a plan. List viable options based on user hardware and requirements, and let the user decide.
Step 1: Gather Information
Confirm the following with the user:
- - Target model: Model name and parameter count (e.g., Qwen2.5-7B)
- Hardware environment:
- GPU VRAM x count (e.g., "single 24GB GPU")
- CPU core count
- RAM size
- Free disk space
- NVMe SSD availability (affects ZeRO NVMe offload)
- - Training goal: Full fine-tuning or parameter-efficient? Dataset size? Expected quality?
- Budget/time constraints: Acceptable training duration?
If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (nvidia-smi, free -h, df -h, nproc).
Step 2: Evaluate Feasibility
Estimate VRAM requirements based on model size (bf16):
| Params | Model Weights (bf16) | + Adam Optimizer + Gradients |
|---|
| 0.5B | ~1 GB | ~5 GB |
| 1.5B |
~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |
Breakdown: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).
Activation memory: Depends on sequence length and batch size, not model params alone.
- - Formula: INLINECODE6
- Example: 7B model (hidden=4096), seqlen=2048, batchsize=4, bf16 -> ~1.5 GB per layer; ~60 GB total (can dominate VRAM)
- Gradient checkpointing reduces this by ~80% (recomputes instead of storing), but adds ~20% compute overhead
- Rule of thumb: if seqlen x batchsize > 8192, activation memory likely exceeds model weights
LoRA/QLoRA: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See references/lora_guide.md for LoRA-specific memory estimation.
Step 2.5: Activation Checkpointing
If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.
How it works: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.
Two ways to enable:
- 1. HF Trainer flag (simplest, works out of the box):
CODEBLOCK0
- 2. DeepSpeed config (fine-grained control):
{
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"number_checkpoints": 4
}
}
| Option | Effect | When to use |
|---|
| INLINECODE7 | Shard checkpoints across model-parallel GPUs | Multi-GPU with model parallelism |
| INLINECODE8 |
Store checkpoints in CPU RAM instead of GPU | GPU memory very tight |
|
contiguous_memory_optimization | Reduce memory fragmentation | Large models, many checkpoints |
|
number_checkpoints | Control checkpoint frequency (fewer = less VRAM, more compute) | Tune based on VRAM budget |
Step 3: List Options
Based on the VRAM assessment, list all viable approaches. Example:
CODEBLOCK2
Step 4: Hardware Insufficient? Make Recommendations
If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):
CODEBLOCK3
Key Principles
- - Never auto-select and start training — always list options and wait for user confirmation
- Recommend but don't decide — say "I recommend Option A because..." but let the user choose
- Use generic hardware metrics — VRAM in GB, GPU count, CPU cores, RAM in GB, disk in GB. No brand names.
- Leave VRAM headroom — recommend at least 20% buffer to avoid OOM
- If user picks an infeasible option, warn them clearly rather than silently switching
Core Capabilities
1. Training Configuration
Generate DeepSpeed ZeRO configurations:
CODEBLOCK4
2. Training Launch
Use the training launcher script:
CODEBLOCK5
3. LoRA/QLoRA Integration
For parameter-efficient fine-tuning:
CODEBLOCK6
4. Multi-GPU Training
Use the deepspeed launcher for multi-GPU training (recommended over torchrun):
CODEBLOCK7
5. Training Monitoring
Monitor training progress:
CODEBLOCK8
6. Early Stopping
Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.
Parameters:
- -
--early_stopping_patience — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10. - INLINECODE14 — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).
Example:
CODEBLOCK9
Auto-configuration: When early_stopping_patience > 0, the script automatically:
- 1. Enables INLINECODE16
- Sets
metric_for_best_model=eval_loss, INLINECODE18 - Aligns
save_strategy with eval_strategy (synced saving is needed to restore best checkpoint)
Notes:
- - Must also set
eval_strategy (e.g., steps + eval_steps), otherwise early stopping won't work - Don't set
patience too low (<3) — early training fluctuations may cause premature stopping - For LoRA fine-tuning,
patience=5 with eval_steps=100 typically works well
Remote Training
When training needs to run on a remote GPU server, see references/remote_training.md for the complete guide including agent guidelines, security model, and command reference.
Troubleshooting
OOM Errors
- - Reduce batch size or increase gradient accumulation steps
- Enable gradient checkpointing: INLINECODE27
- Use ZeRO-3 with CPU/NVMe offloading
- Reduce LoRA rank: INLINECODE28
- See references/troubleshooting.md for detailed solutions
Slow Training
- - Ensure bf16/fp16 is enabled
- Check GPU utilization with INLINECODE29
- Use FlashAttention if available
- Optimize data loading with INLINECODE30
- See references/troubleshooting.md for detailed solutions
Checkpoint Issues
- - Use
--save_strategy steps with INLINECODE32 - Enable
--save_total_limit to cap checkpoint count - For ZeRO-3, use
--zero3_save_16bit_model to save FP16 weights - See references/troubleshooting.md for detailed solutions
MPI Errors (multi-GPU only)
- - Single-GPU training does not need MPI
- If you see MPI errors on single GPU, use
python3 directly instead of deepspeed launcher - See references/troubleshooting.md for full MPI debugging guide
Single-GPU Strategy
References
DeepSpeed 微调技能
本技能支持使用 DeepSpeed 配合多种优化策略进行高效的模型微调。
前置条件
- - Python 3.8+
- 配备 DeepSpeed 支持的后端(CUDA、ROCm、Intel XPU 等)的 GPU 或加速器
- DeepSpeed:pip install deepspeed
- Transformers、Datasets、PEFT(用于 LoRA 支持)
- sshpass:sudo apt-get install sshpass(用于远程训练)
方案选择流程
切勿自动选择方案。 根据用户硬件和需求列出可行选项,让用户自行决定。
第一步:收集信息
与用户确认以下信息:
- - 目标模型:模型名称和参数量(例如 Qwen2.5-7B)
- 硬件环境:
- GPU 显存 × 数量(例如单张 24GB GPU)
- CPU 核心数
- 内存大小
- 可用磁盘空间
- NVMe SSD 可用性(影响 ZeRO NVMe 卸载)
- - 训练目标:全量微调还是参数高效微调?数据集大小?预期质量?
- 预算/时间限制:可接受的训练时长?
如果用户仅提供 SSH 或远程机器地址,先连接并自动检测硬件(nvidia-smi、free -h、df -h、nproc)。
第二步:可行性评估
基于模型大小(bf16)估算显存需求:
| 参数量 | 模型权重(bf16) | + Adam 优化器 + 梯度 |
|---|
| 0.5B | ~1 GB | ~5 GB |
| 1.5B |
~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |
详细说明:Adam 优化器存储 2 个 fp32 状态张量(动量+方差)= 8 字节/参数。梯度 = 2 字节/参数(bf16)。总计约 10 字节/参数(模型权重的 5 倍)。
激活内存:取决于序列长度和批次大小,而非仅模型参数。
- - 公式:激活 ≈ 34 × 序列长度 × 隐藏层大小 × 批次大小 × 每元素字节数
- 示例:7B 模型(隐藏层=4096),序列长度=2048,批次大小=4,bf16 → 每层约 1.5 GB;总计约 60 GB(可能占主导地位)
- 梯度检查点可将此减少约 80%(通过重新计算而非存储),但会增加约 20% 的计算开销
- 经验法则:如果序列长度 × 批次大小 > 8192,激活内存可能超过模型权重
LoRA/QLoRA:显存取决于秩、目标模块和层维度——并非与总模型参数成正比。参见 references/lora_guide.md 了解 LoRA 特定的内存估算。
第二步半:激活检查点
如果显存紧张,激活检查点是最有效的调节手段——可将激活内存减少约 80%。
工作原理:不存储所有中间激活用于反向传播,仅在选定层保存检查点。剩余激活在反向传播过程中重新计算。用计算换内存。
两种启用方式:
- 1. HF Trainer 标志(最简单,开箱即用):
bash
python scripts/ds
train.py --gradientcheckpointing ...
- 2. DeepSpeed 配置(精细控制):
json
{
activation_checkpointing: {
partition_activations: true,
cpu_checkpointing: true,
contiguous
memoryoptimization: true,
number_checkpoints: 4
}
}
| 选项 | 效果 | 使用时机 |
|---|
| partitionactivations | 在模型并行 GPU 间分片检查点 | 多 GPU 且使用模型并行 |
| cpucheckpointing |
将检查点存储在 CPU 内存而非 GPU | GPU 内存非常紧张 |
| contiguous
memoryoptimization | 减少内存碎片 | 大模型、多检查点 |
| number_checkpoints | 控制检查点频率(更少 = 更少显存,更多计算) | 根据显存预算调整 |
第三步:列出选项
基于显存评估,列出所有可行方法。示例:
根据您的硬件(单张 24GB GPU,64GB 内存,500GB 磁盘),
Qwen2.5-7B 有以下训练选项:
选项 A:LoRA 微调(推荐)
- 所需显存:~22 GB
- 速度:快
- 质量:适用于指令对齐、风格适配
- 可训练参数:~20M(总量的 0.4%)
选项 B:QLoRA 微调(节省显存)
- 所需显存:~12 GB
- 速度:中等(量化/反量化开销)
- 质量:略低于 LoRA,但差距很小
选项 C:全量微调(不可行)
- 所需显存:~56 GB(超过 24GB)
- 需要 ZeRO-2 + CPU 卸载,或更大 GPU
您倾向于哪个选项?
第四步:硬件不足?提供建议
如果当前硬件无法执行任何方案,使用通用硬件指标(无品牌名称)推荐规格:
您想全量微调一个 7B 模型,但当前硬件(单张 24GB GPU)不足。
推荐的硬件规格:
最低配置:
- GPU:单张 80GB 显存
- CPU:16 核以上
- 内存:128 GB 以上
- 磁盘:200 GB 以上可用空间
推荐配置:
- GPU:2 张 80GB 显存(ZeRO-2 可将训练速度翻倍)
- CPU:32 核以上
- 内存:256 GB 以上
- 磁盘:500 GB 以上可用空间
或者,使用 LoRA——24GB 显存对 7B 模型来说足够。
关键原则
- - 切勿自动选择并开始训练——始终列出选项并等待用户确认
- 推荐但不决定——可以说我推荐选项 A,因为……但让用户选择
- 使用通用硬件指标——显存以 GB 计、GPU 数量、CPU 核心数、内存以 GB 计、磁盘以 GB 计。无品牌名称。
- 保留显存余量——建议至少 20% 的缓冲以避免 OOM
- 如果用户选择了不可行的选项,明确警告他们,而不是静默切换
核心功能
1. 训练配置
生成 DeepSpeed ZeRO 配置:
python
from scripts.generatedsconfig import generatezeroconfig
ZeRO 阶段 2 带优化器卸载
config = generate
zeroconfig(
zero_stage=2,
offload_optimizer=True,
offload_device=nvme,
nvme
path=/localnvme
)
2. 训练启动
使用训练启动脚本:
bash
python scripts/ds_train.py \
--modelnameor_path meta-llama/Llama-2-7b-hf \
--datasetpath data/mydataset \
--output_dir ./outputs \
--deepspeed assets/dsconfigzero2.json \
--numtrainepochs 3 \
--perdevicetrainbatchsize 4 \
--learning_rate 2e-5 \
--lora_r 16 \
--lora_alpha 32
3. LoRA/QLoRA 集成
用于参数高效微调:
python
LoRA 配置基于参数自动生成
peft_config = {
peft_type: LORA,
r: 16,
lora_alpha: 32,
target
modules: [qproj, v
proj, kproj, o_proj],
lora_dropout: 0.05,
bias: none,
task
type: CAUSALLM
}
4. 多 GPU 训练
使用 deepspeed 启动器进行多 GPU 训练(推荐优于 torchrun):
bash
单节点多 GPU
deepspeed --num
gpus=4 scripts/dstrain.py \
--model
nameor_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/ds
configzero3.json \
...
多节点
deepspeed --hostfile hosts.txt scripts/ds_train.py \