[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

[2602.14201] GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

arXiv - AI 3 min read Article

Summary

GeoEyes introduces a novel framework for enhancing visual understanding in ultra-high-resolution remote sensing imagery, addressing limitations in existing multimodal models.

Why It Matters

The ability to effectively analyze ultra-high-resolution remote sensing imagery is crucial for various applications, including environmental monitoring and urban planning. GeoEyes tackles the challenge of evidence acquisition in visual question answering, improving model performance and usability in critical real-world scenarios.

Key Takeaways

  • GeoEyes employs a staged training framework to enhance visual focusing.
  • Introduces UHR Chain-of-Zoom (UHR-CoZ) dataset for diverse zooming regimes.
  • Utilizes reinforcement learning to reward evidence gain and answer improvement.
  • Achieves 54.23% accuracy on XLRS-Bench, demonstrating significant performance improvements.
  • Addresses the issue of Tool Usage Homogenization in multimodal large language models.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14201 (cs) [Submitted on 15 Feb 2026] Title:GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery Authors:Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du View a PDF of the paper titled GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery, by Fengxiang Wang and 12 other authors View PDF HTML (experimental) Abstract:The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substanti...

Related Articles

Llms

Artificial intelligence will always depends on human otherwise it will be obsolete.

I was looking for a tool for my specific need. There was not any. So i started to write the program in python, just basic structure. Then...

Reddit - Artificial Intelligence · 1 min ·
Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

https://www.researchsquare.com/article/rs-9057643/v1 There’s a massive trend right now where tech companies, businesses, even researchers...

Reddit - Artificial Intelligence · 1 min ·
Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime