[2603.26639] Make Geometry Matter for Spatial Reasoning
About this article
Abstract page for arXiv paper 2603.26639: Make Geometry Matter for Spatial Reasoning
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.26639 (cs) [Submitted on 27 Mar 2026] Title:Make Geometry Matter for Spatial Reasoning Authors:Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang View a PDF of the paper titled Make Geometry Matter for Spatial Reasoning, by Shihua Zhang and 4 other authors View PDF HTML (experimental) Abstract:Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, t...