[2511.22396] Asking like Socrates: Socrates helps VLMs understand remote sensing images
About this article
Abstract page for arXiv paper 2511.22396: Asking like Socrates: Socrates helps VLMs understand remote sensing images
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.22396 (cs) [Submitted on 27 Nov 2025 (v1), last revised 8 Apr 2026 (this version, v2)] Title:Asking like Socrates: Socrates helps VLMs understand remote sensing images Authors:Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li View a PDF of the paper titled Asking like Socrates: Socrates helps VLMs understand remote sensing images, by Run Shao and 11 other authors View PDF HTML (experimental) Abstract:Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: f...