When to Use
User wants to fine-tune a language model, evaluate if fine-tuning is worth it, or debug training issues.
Quick Reference
| Topic | File |
|---|
| Provider comparison & pricing | INLINECODE0 |
| Data preparation & validation |
data-prep.md |
| Training configuration |
training.md |
| Evaluation & debugging |
evaluation.md |
| Cost estimation & ROI |
costs.md |
| Compliance & security |
compliance.md |
Core Capabilities
- 1. Decide fit — Analyze if fine-tuning beats prompting for the use case
- Prepare data — Convert raw data to JSONL, deduplicate, validate format
- Select provider — Compare OpenAI, Anthropic (Bedrock), Google, open source based on constraints
- Estimate costs — Calculate training cost, inference savings, break-even point
- Configure training — Set hyperparameters (learning rate, epochs, LoRA rank)
- Run evaluation — Compare fine-tuned vs base model on task-specific metrics
- Debug failures — Diagnose loss curves, overfitting, catastrophic forgetting
- Handle compliance — Scan for PII, configure on-premise training, generate audit logs
Decision Checklist
Before recommending fine-tuning, ask:
- - [ ] What's the failure mode with prompting? (format, style, knowledge, cost)
- [ ] How many training examples available? (minimum 50-100)
- [ ] Expected inference volume? (affects ROI calculation)
- [ ] Privacy constraints? (determines provider options)
- [ ] Budget for training + ongoing inference?
Fine-Tune vs Prompt Decision
| Signal | Recommendation |
|---|
| Format/style inconsistency | Fine-tune ✓ |
| Missing domain knowledge |
RAG first, then fine-tune if needed |
| High inference volume (>100K/mo) | Fine-tune for cost savings |
| Requirements change frequently | Stick with prompting |
| <50 quality examples | Prompting + few-shot |
Critical Rules
- - Data quality > quantity — 100 great examples beat 1000 noisy ones
- LoRA first — Never jump to full fine-tuning; LoRA is 10-100x cheaper
- Hold out eval set — Always 80/10/10 split; never peek at test data
- Same precision — Train and serve at identical precision (4-bit, 16-bit)
- Baseline first — Run eval on base model before training to measure actual improvement
- Expect iteration — First attempt rarely optimal; plan for 2-3 cycles
Common Pitfalls
| Mistake | Fix |
|---|
| Training on inconsistent data | Manual review of 100+ samples before training |
| Learning rate too high |
Start with 2e-4 for SFT, 5e-6 for RLHF |
| Expecting new knowledge | Fine-tuning adjusts behavior, not knowledge — use RAG |
| No baseline comparison | Always test base model on same eval set |
| Ignoring forgetting | Mix 20% general data to preserve capabilities |
技能名称:微调
适用场景
用户想要微调语言模型、评估微调是否值得,或调试训练问题。
快速参考
| 主题 | 文件 |
|---|
| 供应商对比与定价 | providers.md |
| 数据准备与验证 |
data-prep.md |
| 训练配置 | training.md |
| 评估与调试 | evaluation.md |
| 成本估算与投资回报率 | costs.md |
| 合规与安全 | compliance.md |
核心能力
- 1. 决策适配 — 分析微调是否比提示工程更适合当前用例
- 数据准备 — 将原始数据转换为JSONL格式、去重、验证格式
- 供应商选择 — 根据约束条件对比OpenAI、Anthropic(Bedrock)、Google、开源方案
- 成本估算 — 计算训练成本、推理节省、盈亏平衡点
- 训练配置 — 设置超参数(学习率、训练轮数、LoRA秩)
- 运行评估 — 在任务特定指标上对比微调模型与基础模型
- 故障调试 — 诊断损失曲线、过拟合、灾难性遗忘
- 合规处理 — 扫描个人身份信息、配置本地训练、生成审计日志
决策检查清单
在推荐微调之前,请确认:
- - [ ] 提示工程的失败模式是什么?(格式、风格、知识、成本)
- [ ] 有多少训练样本可用?(最少50-100个)
- [ ] 预期推理量是多少?(影响投资回报率计算)
- [ ] 隐私约束有哪些?(决定供应商选项)
- [ ] 训练和持续推理的预算?
微调与提示决策对比
先使用RAG,必要时再微调 |
| 高推理量(每月超过10万次) | 微调以节省成本 |
| 需求频繁变化 | 坚持使用提示工程 |
| 少于50个高质量样本 | 提示工程+少样本学习 |
关键规则
- - 数据质量重于数量 — 100个优质样本胜过1000个噪声样本
- 优先使用LoRA — 切勿直接进行全参数微调;LoRA成本低10-100倍
- 保留评估集 — 始终按80/10/10比例划分;切勿窥探测试数据
- 保持相同精度 — 训练和推理使用相同精度(4位、16位)
- 先建立基线 — 在训练前对基础模型进行评估,以衡量实际改进
- 预期迭代 — 首次尝试很少达到最优;计划2-3轮迭代
常见陷阱
| 错误 | 修复方法 |
|---|
| 在不一致的数据上训练 | 训练前人工审核100个以上样本 |
| 学习率过高 |
监督微调从2e-4开始,强化学习从5e-6开始 |
| 期望获得新知识 | 微调调整行为而非知识——使用RAG |
| 未进行基线对比 | 始终在相同评估集上测试基础模型 |
| 忽略遗忘问题 | 混合20%通用数据以保留能力 |