[2603.29844] DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
About this article
Abstract page for arXiv paper 2603.29844: DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
Computer Science > Robotics arXiv:2603.29844 (cs) [Submitted on 31 Mar 2026] Title:DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA Authors:Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu View a PDF of the paper titled DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA, by Yi Chen and 5 other authors View PDF HTML (experimental) Abstract:The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage tr...