[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
Summary
GeoEyes introduces a novel framework for enhancing visual understanding in ultra-high-resolution remote sensing imagery, addressing limitations in existing multimodal models.
Why It Matters
The ability to effectively analyze ultra-high-resolution remote sensing imagery is crucial for various applications, including environmental monitoring and urban planning. GeoEyes tackles the challenge of evidence acquisition in visual question answering, improving model performance and usability in critical real-world scenarios.
Key Takeaways
- GeoEyes employs a staged training framework to enhance visual focusing.
- Introduces UHR Chain-of-Zoom (UHR-CoZ) dataset for diverse zooming regimes.
- Utilizes reinforcement learning to reward evidence gain and answer improvement.
- Achieves 54.23% accuracy on XLRS-Bench, demonstrating significant performance improvements.
- Addresses the issue of Tool Usage Homogenization in multimodal large language models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14201 (cs) [Submitted on 15 Feb 2026] Title:GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery Authors:Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du View a PDF of the paper titled GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery, by Fengxiang Wang and 12 other authors View PDF HTML (experimental) Abstract:The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substanti...