[2602.14225] Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding
Summary
This paper explores the significance of staged knowledge injection in enhancing agentic reinforcement learning for ultra-high-resolution remote sensing tasks, demonstrating improved visual reasoning through text-based training.
Why It Matters
The findings highlight a novel approach to overcoming challenges in multimodal reasoning for remote sensing, suggesting that high-quality textual data can effectively guide visual learning. This has implications for improving AI models in environmental monitoring and other applications reliant on remote sensing technologies.
Key Takeaways
- Staged knowledge injection significantly enhances visual reasoning in remote sensing.
- Text-based training can outperform traditional image-based methods in certain scenarios.
- The proposed approach sets a new state-of-the-art performance benchmark for ultra-high-resolution tasks.
Computer Science > Artificial Intelligence arXiv:2602.14225 (cs) [Submitted on 15 Feb 2026] Title:Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding Authors:Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yuhao Zhou, Di Wang, Yifan Zhang, Haoyu Wang, Haiyan Zhao, Hongda Sun, Long Lan, Jun Song, Yulin Wang, Jing Zhang, Wenlong Zhang, Bo Du View a PDF of the paper titled Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding, by Fengxiang Wang and 15 other authors View PDF HTML (experimental) Abstract:Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS this http URL controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concep...