[2601.11675] Generating metamers of human scene understanding
Summary
This article presents MetamerGen, a novel tool that generates metamers of human scene understanding by combining low-resolution gist information with high-resolution details from visual fixations.
Why It Matters
Understanding how humans perceive and interpret visual scenes is crucial for advancements in computer vision and artificial intelligence. MetamerGen offers insights into latent scene representations, enhancing the development of AI systems that can better mimic human visual processing.
Key Takeaways
- MetamerGen generates images based on human scene understanding using a dual-stream representation.
- The tool combines low-resolution peripheral information with high-resolution fixated details.
- A behavioral experiment validated the perceptual alignment of generated images with human scene representations.
- High-level semantic alignment is crucial for predicting metamerism in generated scenes.
- The research contributes to understanding visual processing at multiple levels.
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.11675 (cs) [Submitted on 16 Jan 2026 (v1), last revised 24 Feb 2026 (this version, v3)] Title:Generating metamers of human scene understanding Authors:Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras, Gregory J. Zelinsky View a PDF of the paper titled Generating metamers of human scene understanding, by Ritik Raina and 5 other authors View PDF HTML (experimental) Abstract:Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we con...