[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations
About this article
Abstract page for arXiv paper 2506.08915: Two-stage Vision Transformers and Hard Masking offer Robust Object Representations
Computer Science > Computer Vision and Pattern Recognition arXiv:2506.08915 (cs) [Submitted on 10 Jun 2025 (v1), last revised 1 Apr 2026 (this version, v4)] Title:Two-stage Vision Transformers and Hard Masking offer Robust Object Representations Authors:Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos View a PDF of the paper titled Two-stage Vision Transformers and Hard Masking offer Robust Object Representations, by Ananthu Aniraj and 3 other authors View PDF HTML (experimental) Abstract:Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine st...