[2603.13904] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
About this article
Abstract page for arXiv paper 2603.13904: Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.13904 (cs) [Submitted on 14 Mar 2026 (v1), last revised 25 Mar 2026 (this version, v2)] Title:Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition Authors:Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo View a PDF of the paper titled Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition, by Seokmin Lee and 3 other authors View PDF HTML (experimental) Abstract:For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode...