[2505.03821] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
About this article
Abstract page for arXiv paper 2505.03821: Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2505.03821 (cs) [Submitted on 3 May 2025 (v1), last revised 28 Mar 2026 (this version, v2)] Title:Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models Authors:Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński View a PDF of the paper titled Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models, by Gracjan G\'oral and 5 other authors View PDF HTML (experimental) Abstract:We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a new set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations -- such as object position relative to the minifigure and the minifigure's orientation -- and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. We evaluate several high-performing models, including Gemini Robotics-ER 1.5, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, GPT-4, and Qwen3, and find that while they excel at scene understanding, performance declines markedly on spatial reasoning ...