[2602.17951] ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
Summary
The paper presents ROCKET, a novel framework for enhancing Vision-Language-Action models by employing residual-oriented multi-layer alignment to improve spatial understanding and reduce computational costs.
Why It Matters
This research addresses the limitations of current VLA models that primarily rely on 2D data, enhancing their ability to understand 3D spatial contexts. The proposed method not only improves performance but also significantly reduces computational requirements, making it relevant for advancements in robotics and AI applications.
Key Takeaways
- ROCKET utilizes a shared projector for multi-layer alignment, minimizing gradient conflicts.
- The framework achieves a 98.5% success rate on LIBERO with only 4% of the typical compute budget.
- Empirical results demonstrate ROCKET's superior performance across various VLA models and datasets.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.17951 (cs) [Submitted on 20 Feb 2026] Title:ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models Authors:Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li View a PDF of the paper titled ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models, by Guoheng Sun and 8 other authors View PDF HTML (experimental) Abstract:Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses sh...