[2506.10085] VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models
About this article
Abstract page for arXiv paper 2506.10085: VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2506.10085 (cs) [Submitted on 11 Jun 2025 (v1), last revised 27 Feb 2026 (this version, v5)] Title:VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models Authors:Christos Ziakas, Alessandra Russo View a PDF of the paper titled VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models, by Christos Ziakas and Alessandra Russo View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furt...