[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

[2506.08915] Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2506.08915: Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

Computer Science > Computer Vision and Pattern Recognition arXiv:2506.08915 (cs) [Submitted on 10 Jun 2025 (v1), last revised 1 Apr 2026 (this version, v4)] Title:Two-stage Vision Transformers and Hard Masking offer Robust Object Representations Authors:Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos View a PDF of the paper titled Two-stage Vision Transformers and Hard Masking offer Robust Object Representations, by Ananthu Aniraj and 3 other authors View PDF HTML (experimental) Abstract:Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine st...

Originally published on April 02, 2026. Curated by AI News.

Related Articles

Machine Learning

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, no...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] Trained a small BERT on 276K Kubernetes YAMLs using tree positional encoding instead of sequential

I trained a BERT-style transformer on 276K Kubernetes YAML files, replacing standard positional encoding with learned tree coordinates (d...

Reddit - Machine Learning · 1 min ·
Machine Learning

I am doing a multi-model graph database in pure Rust with Cypher, SQL, Gremlin, and native GNN looking for extreme speed and performance

Hi guys, I'm a PhD student in Applied AI and I've been building an embeddable graph database engine from scratch in Rust. I'd love feedba...

Reddit - Artificial Intelligence · 1 min ·
Llms

Chatgpt vs purpose built ai for cre underwriting: which one can finish the job?

I keep seeing people recommend chatgpt for financial modeling and I need to push back because I spent a month testing it for multifamily ...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime