[2602.16689] Are Object-Centric Representations Better At Compositional Generalization?

[2602.16689] Are Object-Centric Representations Better At Compositional Generalization?

arXiv - Machine Learning 4 min read Article

Summary

This paper investigates the effectiveness of object-centric representations in enhancing compositional generalization in machine learning, particularly in visual question answering tasks.

Why It Matters

Understanding compositional generalization is crucial for advancing AI systems that can reason about novel combinations of familiar concepts. This research provides systematic evidence on how object-centric representations can improve performance in visually rich environments, which is vital for developing more capable AI models.

Key Takeaways

  • Object-centric representations outperform dense representations in challenging compositional generalization tasks.
  • Dense representations excel only in easier settings and require more computational resources.
  • Object-centric models are more sample efficient, achieving better generalization with fewer training images.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.16689 (cs) [Submitted on 18 Feb 2026] Title:Are Object-Centric Representations Better At Compositional Generalization? Authors:Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi View a PDF of the paper titled Are Object-Centric Representations Better At Compositional Generalization?, by Ferdinand Kapl and 6 other authors View PDF HTML (experimental) Abstract:Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder com...

Related Articles

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments
Machine Learning

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments

AI Events · 4 min ·
Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·
Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

I could really use some outside perspective. I’m a senior ML/CV engineer in Canada with about 5–6 years across research and industry. Mas...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime