[2511.12882] Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos
About this article
Abstract page for arXiv paper 2511.12882: Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos
Computer Science > Robotics arXiv:2511.12882 (cs) [Submitted on 17 Nov 2025 (v1), last revised 31 Mar 2026 (this version, v3)] Title:Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos Authors:Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, Yi Xu View a PDF of the paper titled Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos, by Taiyi Su and 7 other authors View PDF HTML (experimental) Abstract:Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi-view framework that compensates for spatial information loss and ensures high-consistency w...