[2507.13231] VITA: Vision-to-Action Flow Matching Policy
About this article
Abstract page for arXiv paper 2507.13231: VITA: Vision-to-Action Flow Matching Policy
Computer Science > Computer Vision and Pattern Recognition arXiv:2507.13231 (cs) [Submitted on 17 Jul 2025 (v1), last revised 3 Mar 2026 (this version, v4)] Title:VITA: Vision-to-Action Flow Matching Policy Authors:Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani View a PDF of the paper titled VITA: Vision-to-Action Flow Matching Policy, by Dechen Gao and 8 other authors View PDF HTML (experimental) Abstract:Conventional flow matching and diffusion-based policies sample via iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA, VIsion-To-Action policy, a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need for visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flo...