[2604.08719] LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
About this article
Abstract page for arXiv paper 2604.08719: LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.08719 (cs) [Submitted on 9 Apr 2026] Title:LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving Authors:Hao Shao, Letian Wang, Yang Zhou, Yuxuan Hu, Zhuofan Zong, Steven L. Waslander, Wei Zhan, Hongsheng Li View a PDF of the paper titled LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving, by Hao Shao and 7 other authors View PDF HTML (experimental) Abstract:Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides c...