[2602.15950] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

[2602.15950] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

arXiv - Machine Learning 4 min read Article

Summary

This article investigates the limitations of vision-language models (VLMs) in spatial reasoning, particularly their struggle to localize non-textual visual elements in binary grids, revealing critical performance gaps across three leading models.

Why It Matters

Understanding the limitations of VLMs is crucial for advancing AI technologies that rely on visual and textual integration. This study highlights significant weaknesses in spatial reasoning capabilities, which could impact applications in computer vision and AI-driven analysis.

Key Takeaways

  • VLMs show a marked decline in accuracy when localizing non-textual elements.
  • Performance varies significantly across models, with distinct failure modes identified.
  • The study underscores the importance of text-recognition pathways in spatial reasoning tasks.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15950 (cs) [Submitted on 17 Feb 2026] Title:Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families Authors:Yuval Levental View a PDF of the paper titled Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families, by Yuval Levental View PDF Abstract:We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their nat...

Related Articles

Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
Llms

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Hi Everybody! I just wanted to share an update on a project I’ve been working on called BULaMU, a family of language models trained (20M,...

Reddit - Machine Learning · 1 min ·
Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users
Llms

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

A study found that sycophancy is pervasive among chatbots, and that bots are more likely than human peers to affirm a person's bad behavior.

AI Tools & Products · 6 min ·
Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch
Llms

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

LiteLLM had obtained two security compliance certifications via Delve and fell victim to some horrific credential-stealing malware last w...

TechCrunch - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime