[2602.15950] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
Summary
This article investigates the limitations of vision-language models (VLMs) in spatial reasoning, particularly their struggle to localize non-textual visual elements in binary grids, revealing critical performance gaps across three leading models.
Why It Matters
Understanding the limitations of VLMs is crucial for advancing AI technologies that rely on visual and textual integration. This study highlights significant weaknesses in spatial reasoning capabilities, which could impact applications in computer vision and AI-driven analysis.
Key Takeaways
- VLMs show a marked decline in accuracy when localizing non-textual elements.
- Performance varies significantly across models, with distinct failure modes identified.
- The study underscores the importance of text-recognition pathways in spatial reasoning tasks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15950 (cs) [Submitted on 17 Feb 2026] Title:Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families Authors:Yuval Levental View a PDF of the paper titled Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families, by Yuval Levental View PDF Abstract:We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their nat...