[2602.12281] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Summary
This paper explores the effectiveness of test-time verification over policy learning in enhancing Vision-Language-Action (VLA) alignment, presenting a novel framework that improves action accuracy significantly.
Why It Matters
The research addresses a critical challenge in robotics and AI: aligning generated actions with natural language instructions. By demonstrating that verification can outperform traditional policy learning, this work has implications for developing more reliable and efficient robotic systems capable of understanding and executing complex tasks.
Key Takeaways
- Test-time verification can reduce the intention-action gap in VLA models.
- The proposed CoVer architecture significantly enhances action selection efficiency.
- Joint scaling of instructions and actions yields better performance than independent scaling.
- Real-world experiments show a 45% improvement in action accuracy with the verification approach.
- CoVer-VLA achieves notable gains in task success rates across various benchmarks.
Computer Science > Robotics arXiv:2602.12281 (cs) [Submitted on 12 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment Authors:Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone View a PDF of the paper titled Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment, by Jacky Kwok and 6 other authors View PDF HTML (experimental) Abstract:The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap." We first characterize the test-time scaling laws for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then i...