[2603.13904] Pixel-level Scene Understanding in One Token: Visual

[2603.13904] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

arXiv - Machine Learning March 26, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.13904: Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.13904 (cs) [Submitted on 14 Mar 2026 (v1), last revised 25 Mar 2026 (this version, v2)] Title:Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition Authors:Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo View a PDF of the paper titled Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition, by Seokmin Lee and 3 other authors View PDF HTML (experimental) Abstract:For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode...

Originally published on March 26, 2026. Curated by AI News.

Robotics

AI system learns to prevent warehouse robot traffic jams, boosting throughput 25%

"Inside a giant autonomous warehouse, hundreds of robots dart down aisles as they collect and distribute items to fulfill a steady stream...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

[2603.16673] When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Abstract page for arXiv paper 2603.16673: When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Rob...

arXiv - Machine Learning · 4 min · about 8 hours ago

Machine Learning

[2512.22854] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Abstract page for arXiv paper 2512.22854: ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum ...

arXiv - Machine Learning · 4 min · about 8 hours ago

Machine Learning

[2511.14427] Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

Abstract page for arXiv paper 2511.14427: Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

arXiv - Machine Learning · 4 min · about 8 hours ago

[2603.13904] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

About this article

Related Articles

AI system learns to prevent warehouse robot traffic jams, boosting throughput 25%

[2603.16673] When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

[2512.22854] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

[2511.14427] Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

No comments

Stay updated with AI News