[2510.18876] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
About this article
Abstract page for arXiv paper 2510.18876: Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.18876 (cs) [Submitted on 21 Oct 2025 (v1), last revised 5 Mar 2026 (this version, v3)] Title:Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Authors:Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang View a PDF of the paper titled Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs, by Haochen Wang and 15 other authors View PDF HTML (experimental) Abstract:While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region,...