FL Plugin — Model Migration Skill
Usage
CODEBLOCK0
| Argument | Required | Default |
|---|
| INLINECODE0 | Yes | — |
| INLINECODE1 |
No |
/tmp/vllm-upstream-ref |
|
plugin_folder | No | current working directory |
Execution
Step 1: Parse arguments and validate paths
Extract from user input:
- -
{{model_name}} = first argument (required, snake_case) - INLINECODE5 = second argument or INLINECODE6
- INLINECODE7 = third argument or current working directory
If {{upstream_folder}} doesn't exist, ask user whether to clone it. If {{plugin_folder}} doesn't exist, error out.
→ Tell user: Confirm parsed model name and paths.
Step 2: Load references and resolve placeholders
Read these files (relative to this SKILL.md):
- -
references/procedure.md — step-by-step migration procedure - INLINECODE11 — 0.13.0 patch catalog
- INLINECODE12 — communication, TaskList, bash rules, resilience
The procedure references executable scripts in scripts/:
- -
scripts/validate_migration.py — automated code review (Step 6) - INLINECODE15 — benchmark verification (Step 9)
- INLINECODE16 — serve model locally (Step 10.1, also used for E2E)
- INLINECODE17 — test request (Step 10.2)
- INLINECODE18 — E2E correctness verification (Step 11)
- INLINECODE19 — test prompts for E2E (5 text + 5 multimodal)
- INLINECODE20 — E2E config template (copy to
e2e_config.json and fill in) - INLINECODE22 — manage GT server on remote machine via SSH
Then investigate upstream source + HuggingFace to resolve all placeholders:
| Placeholder | How to derive |
|---|
| INLINECODE23 | Direct from argument |
| INLINECODE24 |
Lowercase of model_name (usually identical, e.g.
qwen3_5) — used in file paths |
|
{{MODEL_DISPLAY_NAME}} | From upstream code or HF model card |
|
{{ModelClassName}} | From upstream model class (PascalCase) |
|
{{model_type}} | From HF config.json
model_type field |
|
{{ConfigClassName}} | From upstream or derive from model_type |
|
{{skill_root}} | Absolute path to this skill's folder (the directory containing this SKILL.md) |
Naming conventions vary per model — always verify from actual source, never guess.
→ Tell user: Present all resolved values. Use AskUserQuestion if anything is ambiguous.
Step 3: Execute procedure
With placeholders resolved, execute every step in procedure.md sequentially. Apply patches from compatibility-patches.md during the copy-then-patch step. Follow operational-rules.md throughout.
→ Tell user: Before starting, output a numbered plan. Report progress at each step boundary.
Scripts Reference
| Script | Step | Description |
|---|
| INLINECODE35 | 6 | Automated import/API/registration checks |
| INLINECODE36 |
9 |
vllm bench throughput with dummy weights |
|
serve.sh | 10, 11 | Start local vLLM server (port 8122,
VLLM_FL_PREFER_ENABLED=false) |
|
request.sh | 10 | Quick smoke-test request |
|
e2e_eval.py | 11 | Token-level comparison vs upstream GT server |
|
e2e_test_prompts.json | 11 | 5 text + 5 multimodal test prompts |
|
e2e_config.template.json | 11 | Config template (GT machine, local port, eval params) |
|
e2e_remote_serve.sh | 11 | SSH-based GT server lifecycle (start/stop/status/logs) |
Examples
Example 1: Typical new model
CODEBLOCK1
Example 2: Re-run after upstream update
CODEBLOCK2
Troubleshooting
General principle: When any runtime error occurs, first compare vLLM upstream code against both the plugin adaptation and the installed 0.13.0 environment. The diff is the fastest path to root cause. See operational-rules.md § Debugging Priority: Upstream-First for the full protocol.
| Problem | Typical Cause | Fix |
|---|
| INLINECODE46 after copy-then-patch | Missing P1 fix (relative→absolute imports) | Verify all from .xxx converted to from vllm.* or INLINECODE49 |
| INLINECODE50 |
API doesn't exist in 0.13.0 | Check P3 in compatibility-patches.md; stub or remove |
| Config not recognized by vLLM | model_type mismatch or config bridge missing | Verify
_CONFIG_REGISTRY[model_type] matches HF config.json exactly |
| Registration has no effect | Class name or import path typo | Compare with existing registrations in
__init__.py |
| Benchmark
KeyError on config field | Config bridge missing a field | Compare upstream config class vs bridge; add missing fields with defaults |
| Benchmark/Serve fails with OOM or "insufficient memory" | GPUs occupied by other processes | Kill GPU processes:
nvidia-smi --query-compute-apps=pid --format=csv,noheader \| xargs -r kill -9 then retry.
Never skip these steps. |
| Model outputs garbled/gibberish text |
ColumnParallelLinear used for merged projections with different sub-dimensions (TP sharding mismatch) | Override
__init__ to use
MergedColumnParallelLinear(output_sizes=[...]). See P8 in compatibility-patches.md |
|
AssertionError: Duplicate op name | Child class imports custom op from different module path than parent | Use same import path as parent module (e.g.
vllm_fl.ops.fla not
vllm_fl.models.fla_ops). See P11 |
|
AttributeError on
fused_recurrent_* during CUDA graph warmup |
__init__ override with
nn.Module.__init__(self) missed attributes used by inherited
_forward_core | Create ALL attributes from parent's
__init__, especially custom ops. See P12 |
| E2E: local server not reachable |
serve.sh port doesn't match
e2e_config.json local port | Ensure both use same port (default 8122) |
| E2E: GT server not reachable | GT machine down or docker/conda env wrong | Check
e2e_remote_serve.sh status or SSH manually |
| E2E: early token divergence (first 5 tokens) | Weight loading bug, TP sharding error | Check
load_weights,
stacked_params_mapping, MergedColumnParallelLinear |
| E2E: late minor divergence (token #15+) | Numerical noise from different op implementations | Usually acceptable; document in report |
|
resolve_op fails with
VLLM_FL_PREFER_ENABLED=false | Op not registered in dispatch, no fallback | Add try/except fallback to
flag_gems in op import code |
FL 插件 — 模型迁移技能
用法
/model-migrate-flagos <模型名称> [上游文件夹] [插件文件夹]
否 | /tmp/vllm-upstream-ref |
| 插件文件夹 | 否 | 当前工作目录 |
执行流程
步骤 1:解析参数并验证路径
从用户输入中提取:
- - {{模型名称}} = 第一个参数(必需,蛇形命名)
- {{上游文件夹}} = 第二个参数或 /tmp/vllm-upstream-ref
- {{插件文件夹}} = 第三个参数或当前工作目录
如果 {{上游文件夹}} 不存在,询问用户是否克隆。如果 {{插件文件夹}} 不存在,报错退出。
→ 告知用户:确认解析后的模型名称和路径。
步骤 2:加载参考文件并解析占位符
读取以下文件(相对于本 SKILL.md):
- - references/procedure.md — 逐步迁移流程
- references/compatibility-patches.md — 0.13.0 补丁目录
- references/operational-rules.md — 通信、任务列表、bash 规则、弹性策略
流程中引用了 scripts/ 中的可执行脚本:
- - scripts/validatemigration.py — 自动化代码审查(步骤 6)
- scripts/benchmark.sh — 基准测试验证(步骤 9)
- scripts/serve.sh — 本地部署模型(步骤 10.1,也用于端到端测试)
- scripts/request.sh — 测试请求(步骤 10.2)
- scripts/e2eeval.py — 端到端正确性验证(步骤 11)
- scripts/e2etestprompts.json — 端到端测试提示词(5 个文本 + 5 个多模态)
- scripts/e2econfig.template.json — 端到端配置模板(复制为 e2econfig.json 并填写)
- scripts/e2eremoteserve.sh — 通过 SSH 管理远程机器上的 GT 服务器
然后调查上游源码 + HuggingFace 以解析所有占位符:
| 占位符 | 推导方式 |
|---|
| {{模型名称}} | 直接来自参数 |
| {{模型名称小写}} |
模型名称的小写形式(通常相同,例如 qwen35)— 用于文件路径 |
| {{模型显示名称}} | 来自上游代码或 HF 模型卡片 |
| {{模型类名}} | 来自上游模型类(大驼峰命名) |
| {{模型类型}} | 来自 HF config.json 的 model_type 字段 |
| {{配置类名}} | 来自上游或从模型类型推导 |
| {{技能根目录}} | 本技能文件夹的绝对路径(包含此 SKILL.md 的目录) |
不同模型的命名约定各异 — 务必从实际源码验证,切勿猜测。
→ 告知用户:展示所有已解析的值。如有任何歧义,使用 AskUserQuestion。
步骤 3:执行流程
占位符解析完成后,按顺序执行 procedure.md 中的每一步。在复制后修补步骤中应用 compatibility-patches.md 中的补丁。全程遵循 operational-rules.md。
→ 告知用户:开始前,输出编号计划。在每个步骤边界报告进度。
脚本参考
| 脚本 | 步骤 | 描述 |
|---|
| validate_migration.py | 6 | 自动化导入/API/注册检查 |
| benchmark.sh |
9 | 使用虚拟权重的 vllm bench throughput |
| serve.sh | 10, 11 | 启动本地 vLLM 服务器(端口 8122,VLLM
FLPREFER_ENABLED=false) |
| request.sh | 10 | 快速冒烟测试请求 |
| e2e_eval.py | 11 | 与上游 GT 服务器的 Token 级别对比 |
| e2e
testprompts.json | 11 | 5 个文本 + 5 个多模态测试提示词 |
| e2e_config.template.json | 11 | 配置模板(GT 机器、本地端口、评估参数) |
| e2e
remoteserve.sh | 11 | 基于 SSH 的 GT 服务器生命周期管理(启动/停止/状态/日志) |
示例
示例 1:典型新模型
用户说:/model-migrate-flagos kimi_k25
操作:
1. 解析 → 模型名称=kimi_k25,上游/插件路径使用默认值
2. 克隆上游,找到 vllm/modelexecutor/models/kimik25.py
3. 发现它封装了 DeepseekV2 → 遵循 kimi_k25(封装器)模式
4. 复制文件,应用 P1+P2 补丁,创建配置桥接
5. 注册、验证、测试、基准测试、部署+请求
6. 与上游 GT 进行端到端验证
结果:kimi_k25 在插件中完全可用,全部 11 个步骤通过
示例 2:上游更新后重新运行
用户说:重新迁移 qwen3_5,上游已更新
操作:
1. 幂等重运行 — 用新的上游副本覆盖现有文件
2. 重新应用补丁,重新验证,重新测试
3. 重新运行端到端测试以确认无回归
结果:qwen3_5 更新至最新上游版本,无回归
故障排除
通用原则:发生任何运行时错误时,首先将 vLLM 上游代码与插件适配及已安装的 0.13.0 环境进行对比。差异是定位根因的最快途径。详见 operational-rules.md § 调试优先级:上游优先 的完整协议。
| 问题 | 典型原因 | 修复方法 |
|---|
| 复制后修补出现 ImportError | 缺少 P1 修复(相对→绝对导入) | 验证所有 from .xxx 已转换为 from vllm. 或 from vllm_fl. |
| AttributeError: module vllm has no attribute X |
0.13.0 中不存在该 API | 检查 compatibility-patches.md 中的 P3;存根或移除 |
| vLLM 无法识别配置 | model
type 不匹配或缺少配置桥接 | 验证 CONFIG
REGISTRY[modeltype] 与 HF config.json 完全一致 |
| 注册无效 | 类名或导入路径拼写错误 | 与
init.py 中的现有注册项对比 |
| 基准测试 KeyError 配置字段 | 配置桥接缺少字段 | 对比上游配置类与桥接;添加缺失字段及默认值 |
| 基准测试/部署因 OOM 或内存不足失败 | GPU 被其他进程占用 | 终止 GPU 进程:nvidia-smi --query-compute-apps=pid --format=csv,noheader \| xargs -r kill -9 然后重试。
切勿跳过这些步骤。 |
| 模型输出乱码/无意义文本 | 合并投影使用了 ColumnParallelLinear 但子维度不同(TP 分片不匹配) | 重写
init 使用 MergedColumnParallelLinear(output_sizes=[...])。参见 compatibility-patches.md 中的 P8 |
| AssertionError: Duplicate op name | 子类从与父类不同的模块路径导入自定义算子 | 使用与父模块相同的导入路径(例如 vllm
fl.ops.fla 而非 vllmfl.models.fla_ops)。参见 P11 |
| CUDA 图预热期间 fused
recurrent* 出现 AttributeError | 使用 nn.Module.
init(self) 重写
init 时遗漏了继承的
forwardcore 所使用的属性 | 创建父类
init 中的所有属性,特别是自定义算子。参见 P12 |
| 端到端:本地服务器无法访问 | serve.sh 端口与 e2e_config.json 本地端口不匹配 | 确保两者使用相同端口(默认 8122) |
| 端到端:GT 服务器无法访问 | GT 机器宕机或 docker/conda 环境错误 | 检查 e2e
remoteserve.sh status 或手动 SSH |
| 端到端:早期 Token 差异(前 5 个 Token) | 权重加载