[2505.03821] Beyond Recognition: Evaluating Visual Perspective Taking

[2505.03821] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

arXiv - AI March 31, 2026 4 min read

About this article

Abstract page for arXiv paper 2505.03821: Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Computer Science > Computer Vision and Pattern Recognition arXiv:2505.03821 (cs) [Submitted on 3 May 2025 (v1), last revised 28 Mar 2026 (this version, v2)] Title:Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models Authors:Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński View a PDF of the paper titled Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models, by Gracjan G\'oral and 5 other authors View PDF HTML (experimental) Abstract:We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a new set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations -- such as object position relative to the minifigure and the minifigure's orientation -- and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. We evaluate several high-performing models, including Gemini Robotics-ER 1.5, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, GPT-4, and Qwen3, and find that while they excel at scene understanding, performance declines markedly on spatial reasoning ...

Originally published on March 31, 2026. Curated by AI News.

Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min · 38 minutes ago

Llms

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

Abstract page for arXiv paper 2603.23966: Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

arXiv - AI · 4 min · about 2 hours ago

Llms

[2603.16790] InCoder-32B: Code Foundation Model for Industrial Scenarios

Abstract page for arXiv paper 2603.16790: InCoder-32B: Code Foundation Model for Industrial Scenarios

arXiv - AI · 4 min · about 2 hours ago

Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min · about 2 hours ago

[2505.03821] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

About this article

Related Articles

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

[2603.23966] Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

[2603.16790] InCoder-32B: Code Foundation Model for Industrial Scenarios

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

No comments

Stay updated with AI News