[2512.03454] Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
About this article
Abstract page for arXiv paper 2512.03454: Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.03454 (cs) [Submitted on 3 Dec 2025 (v1), last revised 24 Mar 2026 (this version, v3)] Title:Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles Authors:Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li View a PDF of the paper titled Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles, by Haicheng Liao and 10 other authors View PDF HTML (experimental) Abstract:Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for ...