Llms Machine Learning Computer Vision Data Science

[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

arXiv - AI February 17, 2026 3 min read Article

Summary

GeoEyes introduces a novel framework for enhancing visual understanding in ultra-high-resolution remote sensing imagery, addressing limitations in existing multimodal models.

Why It Matters

The ability to effectively analyze ultra-high-resolution remote sensing imagery is crucial for various applications, including environmental monitoring and urban planning. GeoEyes tackles the challenge of evidence acquisition in visual question answering, improving model performance and usability in critical real-world scenarios.

Key Takeaways

GeoEyes employs a staged training framework to enhance visual focusing.
Introduces UHR Chain-of-Zoom (UHR-CoZ) dataset for diverse zooming regimes.
Utilizes reinforcement learning to reward evidence gain and answer improvement.
Achieves 54.23% accuracy on XLRS-Bench, demonstrating significant performance improvements.
Addresses the issue of Tool Usage Homogenization in multimodal large language models.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14201 (cs) [Submitted on 15 Feb 2026] Title:GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery Authors:Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du View a PDF of the paper titled GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery, by Fengxiang Wang and 12 other authors View PDF HTML (experimental) Abstract:The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substanti...

Read Original Article

[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Summary

Why It Matters

Key Takeaways

Related Articles

Artificial intelligence will always depends on human otherwise it will be obsolete.

My AI spent last night modifying its own codebase

Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

No comments

Stay updated with AI News