[2602.19710] Universal Pose Pretraining for Generalizable Vision-Language-Action Policies
Summary
The paper presents Pose-VLA, a novel framework for Vision-Language-Action (VLA) models that separates pre-training and post-training phases to enhance training efficiency and generalization in robotic actions.
Why It Matters
This research addresses critical limitations in existing VLA models, particularly their inefficiency and inability to generalize across diverse tasks. By introducing a structured pre-training approach, it offers a pathway to improve robotic performance and adaptability, which is essential for real-world applications in robotics and AI.
Key Takeaways
- Pose-VLA decouples VLA training into pre-training and post-training phases for improved efficiency.
- The framework uses discrete pose tokens for universal representation, enhancing spatial grounding.
- Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 and competitive performance on LIBERO.
- Real-world experiments confirm robust generalization with minimal demonstrations per task.
- The proposed method addresses feature collapse and low training efficiency in existing models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.19710 (cs) [Submitted on 23 Feb 2026] Title:Universal Pose Pretraining for Generalizable Vision-Language-Action Policies Authors:Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu View a PDF of the paper titled Universal Pose Pretraining for Generalizable Vision-Language-Action Policies, by Haitao Lin and 7 other authors View PDF HTML (experimental) Abstract:Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spat...