[2603.26127] Finding Distributed Object-Centric Properties in Self-Supervised Transformers
About this article
Abstract page for arXiv paper 2603.26127: Finding Distributed Object-Centric Properties in Self-Supervised Transformers
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.26127 (cs) [Submitted on 27 Mar 2026] Title:Finding Distributed Object-Centric Properties in Self-Supervised Transformers Authors:Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja View a PDF of the paper titled Finding Distributed Object-Centric Properties in Self-Supervised Transformers, by Samyak Rawlekar and 5 other authors View PDF HTML (experimental) Abstract:Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts...