Machine Learning Computer Vision Robotics

[2602.19710] Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

arXiv - Machine Learning February 24, 2026 4 min read Article

Summary

The paper presents Pose-VLA, a novel framework for Vision-Language-Action (VLA) models that separates pre-training and post-training phases to enhance training efficiency and generalization in robotic actions.

Why It Matters

This research addresses critical limitations in existing VLA models, particularly their inefficiency and inability to generalize across diverse tasks. By introducing a structured pre-training approach, it offers a pathway to improve robotic performance and adaptability, which is essential for real-world applications in robotics and AI.

Key Takeaways

Pose-VLA decouples VLA training into pre-training and post-training phases for improved efficiency.
The framework uses discrete pose tokens for universal representation, enhancing spatial grounding.
Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 and competitive performance on LIBERO.
Real-world experiments confirm robust generalization with minimal demonstrations per task.
The proposed method addresses feature collapse and low training efficiency in existing models.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.19710 (cs) [Submitted on 23 Feb 2026] Title:Universal Pose Pretraining for Generalizable Vision-Language-Action Policies Authors:Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu View a PDF of the paper titled Universal Pose Pretraining for Generalizable Vision-Language-Action Policies, by Haitao Lin and 7 other authors View PDF HTML (experimental) Abstract:Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spat...

Read Original Article

[2602.19710] Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Summary

Why It Matters

Key Takeaways

Related Articles

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Big increase in the amount of people using AI to write their replies with AI

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

IIT Delhi launches 8th batch of Advanced AI, ML, and DL online programme: Check who is eligible, applicat

No comments

Stay updated with AI News