[2507.20174] LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
Summary
The paper introduces LRR-Bench, a benchmark for evaluating Vision-Language Models (VLMs) on spatial understanding tasks, revealing significant performance gaps compared to human capabilities.
Why It Matters
As VLMs are increasingly integrated into applications like autonomous driving and robotics, understanding their limitations in spatial reasoning is crucial for improving their design and functionality. This research highlights the need for better spatial perception in AI systems, which is vital for real-world applications.
Key Takeaways
- Introduces a new benchmark for assessing VLMs' spatial understanding.
- Identifies significant performance gaps between VLMs and human capabilities.
- Categorizes spatial understanding into absolute and 3D types.
- Utilizes a synthetic dataset to enable low-cost testing.
- Highlights the need for improvements in VLMs for practical applications.
Computer Science > Computer Vision and Pattern Recognition arXiv:2507.20174 (cs) [Submitted on 27 Jul 2025 (v1), last revised 22 Feb 2026 (this version, v3)] Title:LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks Authors:Fei Kong, Jinhao Duan, Kaidi Xu, Zhenhua Guo, Xiaofeng Zhu, Xiaoshuang Shi View a PDF of the paper titled LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks, by Fei Kong and 5 other authors View PDF HTML (experimental) Abstract:Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, ...