Robotics VLA Skill

Expert guidance for building generalist robot policies using Vision-Language-Action (VLA) flow models, based on the π0 architecture.

Core Architecture

π0 model = VLM backbone + action expert + flow matching

Component	Detail
VLM backbone	PaliGemma (3B) — provides visual + language understanding
Action expert

See references/architecture.md for full technical details (attention masks, flow matching math, MoE design).

Training Pipeline

Two-phase approach (mirrors LLM training):

1. Pre-training → broad physical capabilities + recovery behaviors across many tasks/robots
Fine-tuning → fluent, task-specific execution on target task

Key rule: combining both phases outperforms either alone. Pre-training gives robustness; fine-tuning gives precision.

See references/training.md for data mixture ratios, loss functions, and fine-tuning dataset sizing.

Action Representation

Use flow matching, not autoregressive discretization.

- Flow matching models continuous action distributions → essential for high-frequency dexterous control
Autoregressive token prediction (e.g. RT-2 style) cannot produce action chunks efficiently
Action chunks allow open-loop execution at 50Hz without temporal ensembling

Multi-Embodiment Support

Single model handles 7+ robot configurations via:

- Zero-padding smaller action spaces to match the largest (17-dim)
Shared VLM backbone; embodiment-specific behavior learned via data
Weighted task sampling: n^0.43 to handle imbalanced data across robot types

See references/embodiments.md for robot platform specs and action space details.

High-Level Policy Integration

For long-horizon tasks, use a two-tier approach:

- High-level VLM: decomposes task ("bus the table") → subtasks ("pick up napkin")
Low-level π0: executes each subtask as a language-conditioned action sequence

Analogous to SayCan. Intermediate language commands significantly boost performance vs. flat task descriptions.

Related & Complementary Research (2025)

π0 has been extended and complemented by several key works. See references/related-work.md for the full landscape, including:

- π0-FAST / π0.5 / π0.6 — direct successors with faster training, open-world generalization, and RL fine-tuning
RTC — async action chunking to eliminate inference pauses (plug-in, no retraining)
UniVLA — unsupervised action extraction from raw video (no action labels needed)
ManiFlow / Streaming Flow — smoother action generation
GR00T N1, Helix, OpenVLA-OFT, DiVLA, RDT-1B — parallel approaches from NVIDIA, Figure AI, and academia

Evaluation Checklist

When evaluating a robot manipulation policy:

- [ ] Out-of-box generalization (no fine-tuning) vs. baselines
[ ] Language following accuracy with flat / human-guided / HL commands
[ ] Fine-tuning efficiency (success rate vs. hours of data)
[ ] Complex multi-stage tasks (5–20 min, recovery from failure)
[ ] Compare: OpenVLA, Octo, ACT, Diffusion Policy as baselines

机器人VLA技能

使用基于π0架构的视觉-语言-动作（VLA）流模型构建通用机器人策略的专家指南。

核心架构

π0模型 = VLM骨干网络 + 动作专家 + 流匹配

组件	详情
VLM骨干网络	PaliGemma（3B）— 提供视觉+语言理解能力
动作专家

完整技术细节（注意力掩码、流匹配数学、MoE设计）请参见references/architecture.md。

训练流程

两阶段方法（镜像LLM训练）：

1. 预训练 → 跨多个任务/机器人的广泛物理能力+恢复行为
微调 → 目标任务上的流畅、特定任务执行

关键规则：两阶段结合优于单独任一阶段。预训练提供鲁棒性；微调提供精确性。

数据混合比例、损失函数和微调数据集规模请参见references/training.md。

动作表示

使用流匹配，而非自回归离散化。

- 流匹配建模连续动作分布 → 对高频灵巧控制至关重要
自回归令牌预测（例如RT-2风格）无法高效生成动作块
动作块允许在50Hz下进行开环执行，无需时间集成

多实体支持

单个模型通过以下方式处理7种以上机器人配置：

- 将较小动作空间零填充以匹配最大动作空间（17维）
共享VLM骨干网络；通过数据学习特定实体的行为
加权任务采样：n^0.43以处理跨机器人类型的不平衡数据

机器人平台规格和动作空间详情请参见references/embodiments.md。

高层策略集成

对于长时域任务，使用两层方法：

- 高层VLM：将任务（清理餐桌）分解为子任务（拿起餐巾）
低层π0：将每个子任务作为语言条件动作序列执行

类似于SayCan。中间语言指令相比扁平任务描述显著提升性能。

评估检查清单

评估机器人操作策略时：

- [ ] 开箱即用泛化（无微调）vs. 基线
[ ] 使用扁平/人类引导/高层命令的语言跟随准确率
[ ] 微调效率（成功率 vs. 数据小时数）
[ ] 复杂多阶段任务（5-20分钟，从失败中恢复）
[ ] 对比：OpenVLA, Octo, ACT, Diffusion Policy作为基线

robotics-vla机器人VLA

robotics-vla

Robotics VLA Skill

Core Architecture

Training Pipeline

Action Representation

Multi-Embodiment Support

High-Level Policy Integration

Related & Complementary Research (2025)

Evaluation Checklist

机器人VLA技能

核心架构

训练流程

动作表示

多实体支持

高层策略集成

相关与补充研究（2025年）

评估检查清单

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

robotics-vla机器人VLA

robotics-vla

Robotics VLA Skill

Core Architecture

Training Pipeline

Action Representation

Multi-Embodiment Support

High-Level Policy Integration

Related & Complementary Research (2025)

Evaluation Checklist

机器人VLA技能

核心架构

训练流程

动作表示

多实体支持

高层策略集成

相关与补充研究（2025年）

评估检查清单

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement