[2603.19979] X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
About this article
Abstract page for arXiv paper 2603.19979: X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.19979 (cs) [Submitted on 20 Mar 2026] Title:X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving Authors:Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, Yu Zhang, Xianming Liu View a PDF of the paper titled X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving, by Chaoda Zheng and 11 other authors View PDF HTML (experimental) Abstract:Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision--language--action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports option...