[2603.28069] MolmoPoint: Better Pointing for VLMs with Grounding Tokens
About this article
Abstract page for arXiv paper 2603.28069: MolmoPoint: Better Pointing for VLMs with Grounding Tokens
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.28069 (cs) [Submitted on 30 Mar 2026] Title:MolmoPoint: Better Pointing for VLMs with Grounding Tokens Authors:Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna View a PDF of the paper titled MolmoPoint: Better Pointing for VLMs with Grounding Tokens, by Christopher Clark and 10 other authors View PDF HTML (experimental) Abstract:Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting...