[2603.26690] SpatialPoint: Spatial-aware Point Prediction for Embodied Localization
About this article
Abstract page for arXiv paper 2603.26690: SpatialPoint: Spatial-aware Point Prediction for Embodied Localization
Computer Science > Robotics arXiv:2603.26690 (cs) [Submitted on 16 Mar 2026] Title:SpatialPoint: Spatial-aware Point Prediction for Embodied Localization Authors:Qiming Zhu, Zhirui Fang, Tianming Zhang, Chuanxiu Liu, Xiaoke Jiang, Lei Zhang View a PDF of the paper titled SpatialPoint: Spatial-aware Point Prediction for Embodied Localization, by Qiming Zhu and 5 other authors View PDF HTML (experimental) Abstract:Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization -- the problem of predicting executable 3D points conditioned on visual observations and language instructions. We instantiate embodied localization with two complementary target types: touchable points, surface-grounded 3D points enabling direct physical interaction, and air points, free-space 3D points specifying placement and navigation goals, directional constraints, or geometric relations. Embodied localization is inherently a problem of embodied 3D spatial reasoning -- yet most existing vision-language systems rely predominantly on RGB inputs, necessitating implicit geometric reconstruction that limits cross-scene generalization, despite the widespread adoption of RGB-D sensors in robotics. To address this gap, we propose SpatialPoint, a spatial-aware vision-language framework with careful design that integrates structured depth into a vision-language model (VLM) and generates camera-frame 3D coordinates. We c...