[2506.21458] MindCube: Spatial Mental Modeling from Limited Views
About this article
Abstract page for arXiv paper 2506.21458: MindCube: Spatial Mental Modeling from Limited Views
Computer Science > Artificial Intelligence arXiv:2506.21458 (cs) [Submitted on 26 Jun 2025 (v1), last revised 31 Mar 2026 (this version, v2)] Title:MindCube: Spatial Mental Modeling from Limited Views Authors:Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, Manling Li View a PDF of the paper titled MindCube: Spatial Mental Modeling from Limited Views, by Qineng Wang and 13 other authors View PDF Abstract:Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models naturally, internal representations of unseen space, to reason about layout, perspective, and motion. Our MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help approximate spatial mental models in VLMs, focusing on incorporating unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and...