[2601.22228] Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
About this article
Abstract page for arXiv paper 2601.22228: Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.22228 (cs) [Submitted on 29 Jan 2026 (v1), last revised 29 Apr 2026 (this version, v2)] Title:Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation Authors:Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser View a PDF of the paper titled Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation, by Ken Deng and 4 other authors View PDF HTML (experimental) Abstract:We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and introduce \texttt{VRRPI-Bench}, built from real RGB-D frames with object-centric camera motion, and \texttt{VRRPI-Diag}, which isolates individual motion degrees of freedom. Humans (0.91) and specialized geometric pipelines such as LoFTR (0.99) solve the task reliably, yet the best VLM reaches only 0.66 and most others remain near random. Our analyses show that this gap is not basic spatial competence: strong VLMs are near ceiling on single-image benchmarks, but most remain near random once reasoning must span views. They are unstable under source-target reversal (best 59.7\% consistency) and remain weak even in simplified single-DoF settings, especially on optical-axis motions such as roll and depth translation (GPT-5: 0.46 on roll). These failures are useful: they...