[2603.26741] Language-Conditioned World Modeling for Visual Navigation
About this article
Abstract page for arXiv paper 2603.26741: Language-Conditioned World Modeling for Visual Navigation
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.26741 (cs) [Submitted on 23 Mar 2026] Title:Language-Conditioned World Modeling for Visual Navigation Authors:Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng View a PDF of the paper titled Language-Conditioned World Modeling for Visual Navigation, by Yifei Dong and 12 other authors View PDF HTML (experimental) Abstract:We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts...