[2603.26731] Contextual inference from single objects in Vision-Language models
About this article
Abstract page for arXiv paper 2603.26731: Contextual inference from single objects in Vision-Language models
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.26731 (cs) [Submitted on 20 Mar 2026] Title:Contextual inference from single objects in Vision-Language models Authors:Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig View a PDF of the paper titled Contextual inference from single objects in Vision-Language models, by Martina G. Vilas and 2 other authors View PDF HTML (experimental) Abstract:How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are mor...