[2211.12817] Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI
Summary
This article explores self-supervised context reasoning in humans and AI, presenting a model called SeCo that learns contextual relationships from complex scenes, outperforming existing methods.
Why It Matters
Understanding how humans perceive contextual relationships without explicit supervision can enhance AI models' capabilities in scene understanding. This research bridges cognitive science and AI, potentially leading to more intuitive and efficient AI systems that mimic human reasoning.
Key Takeaways
- Humans learn contextual associations rapidly without explicit feedback.
- The SeCo model utilizes separate vision encoders and external memory for contextual reasoning.
- SeCo outperforms state-of-the-art self-supervised learning approaches.
- The study highlights the importance of contextual associations in scene understanding.
- Insights from human cognition can inform the development of advanced AI systems.
Computer Science > Computer Vision and Pattern Recognition arXiv:2211.12817 (cs) [Submitted on 23 Nov 2022 (v1), last revised 23 Feb 2026 (this version, v3)] Title:Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI Authors:Xiao Liu, Soumick Sarker, Ankur Sikarwar, Bryan Atista Kiely, Gabriel Kreiman, Zenglin Shi, Mengmi Zhang View a PDF of the paper titled Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI, by Xiao Liu and 6 other authors View PDF HTML (experimental) Abstract:Humans rarely perceive objects in isolation but interpret scenes through relationships among co-occurring elements. How such contextual knowledge is acquired without explicit supervision remains unclear. Here we combine human psychophysics experiments with computational modelling to study the emergence of contextual reasoning. Participants were exposed to novel objects embedded in naturalistic scenes that followed predefined contextual rules capturing global context, local context and crowding. After viewing short training videos, participants completed a "lift-the-flap" task in which a hidden object had to be inferred from the surrounding context under variations in size, resolution and spatial arrangement. Humans rapidly learned these contextual associations without labels or feedback and generalised robustly across contextual changes. We then introduce SeCo (Self-supervised learning for Context Reasoning), a biologi...