Robotics VLA Skill
Expert guidance for building generalist robot policies using Vision-Language-Action (VLA) flow models, based on the π0 architecture.
Core Architecture
π0 model = VLM backbone + action expert + flow matching
| Component | Detail |
|---|
| VLM backbone | PaliGemma (3B) — provides visual + language understanding |
| Action expert |
Separate transformer weights (~300M) for robot state + actions |
| Total params | ~3.3B |
| Action output | Chunks of H=50 actions; 50Hz or 20Hz robots |
| Inference speed | ~73ms on RTX 4090 |
See references/architecture.md for full technical details (attention masks, flow matching math, MoE design).
Training Pipeline
Two-phase approach (mirrors LLM training):
- 1. Pre-training → broad physical capabilities + recovery behaviors across many tasks/robots
- Fine-tuning → fluent, task-specific execution on target task
Key rule: combining both phases outperforms either alone. Pre-training gives robustness; fine-tuning gives precision.
See references/training.md for data mixture ratios, loss functions, and fine-tuning dataset sizing.
Action Representation
Use flow matching, not autoregressive discretization.
- - Flow matching models continuous action distributions → essential for high-frequency dexterous control
- Autoregressive token prediction (e.g. RT-2 style) cannot produce action chunks efficiently
- Action chunks allow open-loop execution at 50Hz without temporal ensembling
Multi-Embodiment Support
Single model handles 7+ robot configurations via:
- - Zero-padding smaller action spaces to match the largest (17-dim)
- Shared VLM backbone; embodiment-specific behavior learned via data
- Weighted task sampling: n^0.43 to handle imbalanced data across robot types
See references/embodiments.md for robot platform specs and action space details.
High-Level Policy Integration
For long-horizon tasks, use a two-tier approach:
- - High-level VLM: decomposes task ("bus the table") → subtasks ("pick up napkin")
- Low-level π0: executes each subtask as a language-conditioned action sequence
Analogous to SayCan. Intermediate language commands significantly boost performance vs. flat task descriptions.
Related & Complementary Research (2025)
π0 has been extended and complemented by several key works. See references/related-work.md for the full landscape, including:
- - π0-FAST / π0.5 / π0.6 — direct successors with faster training, open-world generalization, and RL fine-tuning
- RTC — async action chunking to eliminate inference pauses (plug-in, no retraining)
- UniVLA — unsupervised action extraction from raw video (no action labels needed)
- ManiFlow / Streaming Flow — smoother action generation
- GR00T N1, Helix, OpenVLA-OFT, DiVLA, RDT-1B — parallel approaches from NVIDIA, Figure AI, and academia
Evaluation Checklist
When evaluating a robot manipulation policy:
- - [ ] Out-of-box generalization (no fine-tuning) vs. baselines
- [ ] Language following accuracy with flat / human-guided / HL commands
- [ ] Fine-tuning efficiency (success rate vs. hours of data)
- [ ] Complex multi-stage tasks (5–20 min, recovery from failure)
- [ ] Compare: OpenVLA, Octo, ACT, Diffusion Policy as baselines
机器人VLA技能
使用基于π0架构的视觉-语言-动作(VLA)流模型构建通用机器人策略的专家指南。
核心架构
π0模型 = VLM骨干网络 + 动作专家 + 流匹配
| 组件 | 详情 |
|---|
| VLM骨干网络 | PaliGemma(3B)— 提供视觉+语言理解能力 |
| 动作专家 |
独立的Transformer权重(约300M)用于机器人状态+动作 |
| 总参数量 | 约3.3B |
| 动作输出 | H=50个动作块;50Hz或20Hz机器人 |
| 推理速度 | RTX 4090上约73ms |
完整技术细节(注意力掩码、流匹配数学、MoE设计)请参见references/architecture.md。
训练流程
两阶段方法(镜像LLM训练):
- 1. 预训练 → 跨多个任务/机器人的广泛物理能力+恢复行为
- 微调 → 目标任务上的流畅、特定任务执行
关键规则:两阶段结合优于单独任一阶段。预训练提供鲁棒性;微调提供精确性。
数据混合比例、损失函数和微调数据集规模请参见references/training.md。
动作表示
使用流匹配,而非自回归离散化。
- - 流匹配建模连续动作分布 → 对高频灵巧控制至关重要
- 自回归令牌预测(例如RT-2风格)无法高效生成动作块
- 动作块允许在50Hz下进行开环执行,无需时间集成
多实体支持
单个模型通过以下方式处理7种以上机器人配置:
- - 将较小动作空间零填充以匹配最大动作空间(17维)
- 共享VLM骨干网络;通过数据学习特定实体的行为
- 加权任务采样:n^0.43以处理跨机器人类型的不平衡数据
机器人平台规格和动作空间详情请参见references/embodiments.md。
高层策略集成
对于长时域任务,使用两层方法:
- - 高层VLM:将任务(清理餐桌)分解为子任务(拿起餐巾)
- 低层π0:将每个子任务作为语言条件动作序列执行
类似于SayCan。中间语言指令相比扁平任务描述显著提升性能。
相关与补充研究(2025年)
π0已被多项关键工作扩展和补充。完整图景请参见references/related-work.md,包括:
- - π0-FAST / π0.5 / π0.6 — 直接后继版本,具有更快的训练、开放世界泛化和强化学习微调
- RTC — 异步动作分块以消除推理暂停(即插即用,无需重新训练)
- UniVLA — 从原始视频中无监督提取动作(无需动作标签)
- ManiFlow / Streaming Flow — 更平滑的动作生成
- GR00T N1, Helix, OpenVLA-OFT, DiVLA, RDT-1B — 来自NVIDIA、Figure AI和学术界的并行方法
评估检查清单
评估机器人操作策略时:
- - [ ] 开箱即用泛化(无微调)vs. 基线
- [ ] 使用扁平/人类引导/高层命令的语言跟随准确率
- [ ] 微调效率(成功率 vs. 数据小时数)
- [ ] 复杂多阶段任务(5-20分钟,从失败中恢复)
- [ ] 对比:OpenVLA, Octo, ACT, Diffusion Policy作为基线