[2510.00060] Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
About this article
Abstract page for arXiv paper 2510.00060: Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.00060 (cs) [Submitted on 29 Sep 2025 (v1), last revised 27 Feb 2026 (this version, v3)] Title:Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving Authors:Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang View a PDF of the paper titled Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving, by Sheng Yang and 4 other authors View PDF HTML (experimental) Abstract:In this work, we reconceptualize autonomous driving as a generalized language problem and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving, named in tribute to the renowned Dutch racing driver Max Verstappen. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the Vision-Language Model (VLM) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to mastering complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves state-of-the-art performance on the nuScenes dataset, delivering an overa...