DeepSpeed Fine-tuning Skill

This skill enables efficient model fine-tuning using DeepSpeed with various optimization strategies.

Prerequisites

- Python 3.8+
GPU(s) or accelerator(s) with DeepSpeed-supported backend (CUDA, ROCm, Intel XPU, etc.)
DeepSpeed: INLINECODE0
Transformers, Datasets, PEFT (for LoRA support)
sshpass: sudo apt-get install sshpass (for remote training)

Plan Selection Workflow

Never auto-select a plan. List viable options based on user hardware and requirements, and let the user decide.

Step 1: Gather Information

Confirm the following with the user:

- Target model: Model name and parameter count (e.g., Qwen2.5-7B)
Hardware environment:

- GPU VRAM x count (e.g., "single 24GB GPU")
- CPU core count
- RAM size
- Free disk space
- NVMe SSD availability (affects ZeRO NVMe offload)

- Training goal: Full fine-tuning or parameter-efficient? Dataset size? Expected quality?
Budget/time constraints: Acceptable training duration?

If the user only provides an SSH or remote machine address, connect first and auto-detect hardware (nvidia-smi, free -h, df -h, nproc).

Step 2: Evaluate Feasibility

Estimate VRAM requirements based on model size (bf16):

Params	Model Weights (bf16)	+ Adam Optimizer + Gradients
0.5B	~1 GB	~5 GB
1.5B

~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |

Breakdown: Adam optimizer stores 2 fp32 state tensors (momentum + variance) = 8 bytes/param. Gradients = 2 bytes/param (bf16). Total approx. 10 bytes/param (5x model weight size).

Activation memory: Depends on sequence length and batch size, not model params alone.

- Formula: INLINECODE6
Example: 7B model (hidden=4096), seqlen=2048, batchsize=4, bf16 -> ~1.5 GB per layer; ~60 GB total (can dominate VRAM)
Gradient checkpointing reduces this by ~80% (recomputes instead of storing), but adds ~20% compute overhead
Rule of thumb: if seqlen x batchsize > 8192, activation memory likely exceeds model weights

LoRA/QLoRA: VRAM depends on rank, target modules, and layer dimensions — not directly proportional to total model params. See references/lora_guide.md for LoRA-specific memory estimation.

Step 2.5: Activation Checkpointing

If VRAM is tight, activation checkpointing is the most impactful knob — it can reduce activation memory by ~80%.

How it works: Instead of storing all intermediate activations for backprop, only save checkpoints at select layers. Remaining activations are recomputed during backward pass. Trades compute for memory.

Two ways to enable:

1. HF Trainer flag (simplest, works out of the box):

CODEBLOCK0

2. DeepSpeed config (fine-grained control):

{
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 4
  }
}

Option	Effect	When to use
INLINECODE7	Shard checkpoints across model-parallel GPUs	Multi-GPU with model parallelism
INLINECODE8

Step 3: List Options

Based on the VRAM assessment, list all viable approaches. Example:

CODEBLOCK2

Step 4: Hardware Insufficient? Make Recommendations

If no plan is viable on current hardware, recommend specs using generic hardware metrics (no brand names):

CODEBLOCK3

Key Principles

- Never auto-select and start training — always list options and wait for user confirmation
Recommend but don't decide — say "I recommend Option A because..." but let the user choose
Use generic hardware metrics — VRAM in GB, GPU count, CPU cores, RAM in GB, disk in GB. No brand names.
Leave VRAM headroom — recommend at least 20% buffer to avoid OOM
If user picks an infeasible option, warn them clearly rather than silently switching

Core Capabilities

1. Training Configuration

Generate DeepSpeed ZeRO configurations:

CODEBLOCK4

2. Training Launch

Use the training launcher script:

CODEBLOCK5

3. LoRA/QLoRA Integration

For parameter-efficient fine-tuning:

CODEBLOCK6

4. Multi-GPU Training

Use the deepspeed launcher for multi-GPU training (recommended over torchrun):

CODEBLOCK7

5. Training Monitoring

Monitor training progress:

CODEBLOCK8

6. Early Stopping

Automatically monitors eval loss and stops training early when there's no improvement across consecutive evaluations, then loads the best checkpoint.

Parameters:

- --early_stopping_patience — How many consecutive evals without improvement to tolerate. Set to 0 to disable (default). Recommended: 3-10.
INLINECODE14 — Minimum eval loss improvement to count as an improvement. Default 0.0 (any decrease counts).

Example:

CODEBLOCK9

Auto-configuration: When early_stopping_patience > 0, the script automatically:

1. Enables INLINECODE16
Sets metric_for_best_model=eval_loss, INLINECODE18
Aligns save_strategy with eval_strategy (synced saving is needed to restore best checkpoint)

Notes:

- Must also set eval_strategy (e.g., steps + eval_steps), otherwise early stopping won't work
Don't set patience too low (<3) — early training fluctuations may cause premature stopping
For LoRA fine-tuning, patience=5 with eval_steps=100 typically works well

Remote Training

When training needs to run on a remote GPU server, see references/remote_training.md for the complete guide including agent guidelines, security model, and command reference.

Troubleshooting

OOM Errors

- Reduce batch size or increase gradient accumulation steps
Enable gradient checkpointing: INLINECODE27
Use ZeRO-3 with CPU/NVMe offloading
Reduce LoRA rank: INLINECODE28
See references/troubleshooting.md for detailed solutions

Slow Training

- Ensure bf16/fp16 is enabled
Check GPU utilization with INLINECODE29
Use FlashAttention if available
Optimize data loading with INLINECODE30
See references/troubleshooting.md for detailed solutions

Checkpoint Issues

- Use --save_strategy steps with INLINECODE32
Enable --save_total_limit to cap checkpoint count
For ZeRO-3, use --zero3_save_16bit_model to save FP16 weights
See references/troubleshooting.md for detailed solutions

MPI Errors (multi-GPU only)

- Single-GPU training does not need MPI
If you see MPI errors on single GPU, use python3 directly instead of deepspeed launcher
See references/troubleshooting.md for full MPI debugging guide

Single-GPU Strategy

- See references/singlegpu_strategy.md for strategy selection, CPU/NVMe offload examples, and decision principles

References

- Quick Start Guide — Common training patterns and full examples
DeepSpeed Guide — DeepSpeed documentation and configuration reference
LoRA/PEFT Best Practices — LoRA/QLoRA parameter tuning guide
ZeRO Optimization Guide — ZeRO stage comparison and optimization tips
Single-GPU Strategy — Strategy selection for single-GPU training
Remote Training Guide — Remote training via SSH, agent guidelines, and security model
Troubleshooting — Common errors and solutions (OOM, NaN loss, MPI, NCCL, etc.)

DeepSpeed 微调技能

本技能支持使用 DeepSpeed 配合多种优化策略进行高效的模型微调。

前置条件

- Python 3.8+
配备 DeepSpeed 支持的后端（CUDA、ROCm、Intel XPU 等）的 GPU 或加速器
DeepSpeed：pip install deepspeed
Transformers、Datasets、PEFT（用于 LoRA 支持）
sshpass：sudo apt-get install sshpass（用于远程训练）

方案选择流程

切勿自动选择方案。 根据用户硬件和需求列出可行选项，让用户自行决定。

第一步：收集信息

与用户确认以下信息：

- 目标模型：模型名称和参数量（例如 Qwen2.5-7B）
硬件环境：

- GPU 显存 × 数量（例如单张 24GB GPU）
- CPU 核心数
- 内存大小
- 可用磁盘空间
- NVMe SSD 可用性（影响 ZeRO NVMe 卸载）

- 训练目标：全量微调还是参数高效微调？数据集大小？预期质量？
预算/时间限制：可接受的训练时长？

如果用户仅提供 SSH 或远程机器地址，先连接并自动检测硬件（nvidia-smi、free -h、df -h、nproc）。

第二步：可行性评估

基于模型大小（bf16）估算显存需求：

参数量	模型权重（bf16）	+ Adam 优化器 + 梯度
0.5B	~1 GB	~5 GB
1.5B

~3 GB | ~15 GB |
| 3B | ~6 GB | ~30 GB |
| 7B | ~14 GB | ~70 GB |
| 14B | ~28 GB | ~140 GB |
| 32B | ~64 GB | ~320 GB |
| 72B | ~144 GB | ~720 GB |

详细说明：Adam 优化器存储 2 个 fp32 状态张量（动量+方差）= 8 字节/参数。梯度 = 2 字节/参数（bf16）。总计约 10 字节/参数（模型权重的 5 倍）。

激活内存：取决于序列长度和批次大小，而非仅模型参数。

- 公式：激活 ≈ 34 × 序列长度 × 隐藏层大小 × 批次大小 × 每元素字节数
示例：7B 模型（隐藏层=4096），序列长度=2048，批次大小=4，bf16 → 每层约 1.5 GB；总计约 60 GB（可能占主导地位）
梯度检查点可将此减少约 80%（通过重新计算而非存储），但会增加约 20% 的计算开销
经验法则：如果序列长度 × 批次大小 > 8192，激活内存可能超过模型权重

LoRA/QLoRA：显存取决于秩、目标模块和层维度——并非与总模型参数成正比。参见 references/lora_guide.md 了解 LoRA 特定的内存估算。

第二步半：激活检查点

如果显存紧张，激活检查点是最有效的调节手段——可将激活内存减少约 80%。

工作原理：不存储所有中间激活用于反向传播，仅在选定层保存检查点。剩余激活在反向传播过程中重新计算。用计算换内存。

两种启用方式：

1. HF Trainer 标志（最简单，开箱即用）：

bash python scripts/dstrain.py --gradientcheckpointing ...

2. DeepSpeed 配置（精细控制）：

json { activation_checkpointing: { partition_activations: true, cpu_checkpointing: true, contiguousmemoryoptimization: true, number_checkpoints: 4 } }

选项	效果	使用时机
partitionactivations	在模型并行 GPU 间分片检查点	多 GPU 且使用模型并行
cpucheckpointing

第三步：列出选项

基于显存评估，列出所有可行方法。示例：

根据您的硬件（单张 24GB GPU，64GB 内存，500GB 磁盘），
Qwen2.5-7B 有以下训练选项：

选项 A：LoRA 微调（推荐）
- 所需显存：~22 GB
- 速度：快
- 质量：适用于指令对齐、风格适配
- 可训练参数：~20M（总量的 0.4%）

选项 B：QLoRA 微调（节省显存）
- 所需显存：~12 GB
- 速度：中等（量化/反量化开销）
- 质量：略低于 LoRA，但差距很小

选项 C：全量微调（不可行）
- 所需显存：~56 GB（超过 24GB）
- 需要 ZeRO-2 + CPU 卸载，或更大 GPU

您倾向于哪个选项？

第四步：硬件不足？提供建议

如果当前硬件无法执行任何方案，使用通用硬件指标（无品牌名称）推荐规格：

您想全量微调一个 7B 模型，但当前硬件（单张 24GB GPU）不足。
推荐的硬件规格：

最低配置：
- GPU：单张 80GB 显存
- CPU：16 核以上
- 内存：128 GB 以上
- 磁盘：200 GB 以上可用空间

推荐配置：
- GPU：2 张 80GB 显存（ZeRO-2 可将训练速度翻倍）
- CPU：32 核以上
- 内存：256 GB 以上
- 磁盘：500 GB 以上可用空间

或者，使用 LoRA——24GB 显存对 7B 模型来说足够。

关键原则

- 切勿自动选择并开始训练——始终列出选项并等待用户确认
推荐但不决定——可以说我推荐选项 A，因为……但让用户选择
使用通用硬件指标——显存以 GB 计、GPU 数量、CPU 核心数、内存以 GB 计、磁盘以 GB 计。无品牌名称。
保留显存余量——建议至少 20% 的缓冲以避免 OOM
如果用户选择了不可行的选项，明确警告他们，而不是静默切换

核心功能

1. 训练配置

生成 DeepSpeed ZeRO 配置：

python
from scripts.generatedsconfig import generatezeroconfig

ZeRO 阶段 2 带优化器卸载

config = generatezeroconfig( zero_stage=2, offload_optimizer=True, offload_device=nvme, nvmepath=/localnvme )

2. 训练启动

使用训练启动脚本：

bash
python scripts/ds_train.py \
--modelnameor_path meta-llama/Llama-2-7b-hf \
--datasetpath data/mydataset \
--output_dir ./outputs \
--deepspeed assets/dsconfigzero2.json \
--numtrainepochs 3 \
--perdevicetrainbatchsize 4 \
--learning_rate 2e-5 \
--lora_r 16 \
--lora_alpha 32

3. LoRA/QLoRA 集成

用于参数高效微调：

python

LoRA 配置基于参数自动生成

peft_config = {
peft_type: LORA,
r: 16,
lora_alpha: 32,
targetmodules: [qproj, vproj, kproj, o_proj],
lora_dropout: 0.05,
bias: none,
tasktype: CAUSALLM
}

4. 多 GPU 训练

使用 deepspeed 启动器进行多 GPU 训练（推荐优于 torchrun）：

bash

单节点多 GPU

deepspeed --numgpus=4 scripts/dstrain.py \
--modelnameor_path meta-llama/Llama-2-7b-hf \
--deepspeed assets/dsconfigzero3.json \
...

多节点

deepspeed --hostfile hosts.txt scripts/ds_train.py \

deepspeed-finetuneDeepSpeed微调

deepspeed-finetune

DeepSpeed Fine-tuning Skill

Prerequisites

Plan Selection Workflow

Step 1: Gather Information

Step 2: Evaluate Feasibility

Step 2.5: Activation Checkpointing

Step 3: List Options

Step 4: Hardware Insufficient? Make Recommendations

Key Principles

Core Capabilities

1. Training Configuration

2. Training Launch

3. LoRA/QLoRA Integration

4. Multi-GPU Training

5. Training Monitoring

6. Early Stopping

Remote Training

Troubleshooting

OOM Errors

Slow Training

Checkpoint Issues

MPI Errors (multi-GPU only)

Single-GPU Strategy

References

DeepSpeed 微调技能

前置条件

方案选择流程

第一步：收集信息

第二步：可行性评估

第二步半：激活检查点

第三步：列出选项

第四步：硬件不足？提供建议

关键原则

核心功能

1. 训练配置

ZeRO 阶段 2 带优化器卸载

2. 训练启动

3. LoRA/QLoRA 集成

LoRA 配置基于参数自动生成

4. 多 GPU 训练

单节点多 GPU

多节点

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement