[2602.12486] Human-Like Coarse Object Representations in Vision Models
Summary
This paper explores how vision models can develop human-like coarse object representations, emphasizing the balance between detail and physical prediction efficiency.
Why It Matters
Understanding how vision models can mimic human object representation is crucial for advancements in AI and robotics. This research highlights the importance of model training parameters in achieving efficient physical predictions, which can enhance AI applications in various fields, including autonomous systems and computer vision.
Key Takeaways
- Human-like coarse object representations emerge from resource constraints in model training.
- An inverse U-shaped curve indicates optimal model size and training time for aligning with human behavior.
- Early checkpoints and modest architectures can effectively elicit physics-efficient representations.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.12486 (cs) [Submitted on 12 Feb 2026] Title:Human-Like Coarse Object Representations in Vision Models Authors:Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman View a PDF of the paper titled Human-Like Coarse Object Representations in Vision Models, by Andrey Gizdov and 4 other authors View PDF HTML (experimental) Abstract:Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-ef...