[2602.15892] Egocentric Bias in Vision-Language Models
Summary
The paper introduces FlipSet, a benchmark for assessing visual perspective taking in vision-language models, revealing significant egocentric bias in their performance.
Why It Matters
Understanding egocentric bias in vision-language models is crucial for improving AI's social cognition and spatial reasoning capabilities. The findings highlight limitations in current models, which could inform future research and development in AI systems that require perspective-taking abilities.
Key Takeaways
- FlipSet benchmark assesses Level-2 visual perspective taking in VLMs.
- Most vision-language models exhibit systematic egocentric bias.
- Models perform well in isolation but fail in integrated tasks requiring spatial reasoning.
- Current VLMs lack mechanisms to bind social awareness with spatial operations.
- The study provides insights for improving AI's cognitive capabilities.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15892 (cs) [Submitted on 10 Feb 2026] Title:Egocentric Bias in Vision-Language Models Authors:Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo View a PDF of the paper titled Egocentric Bias in Vision-Language Models, by Maijunxian Wang and 8 other authors View PDF HTML (experimental) Abstract:Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimo...