[2604.09025] Skill-Conditioned Visual Geolocation for Vision-Language
About this article
Abstract page for arXiv paper 2604.09025: Skill-Conditioned Visual Geolocation for Vision-Language
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.09025 (cs) [Submitted on 10 Apr 2026] Title:Skill-Conditioned Visual Geolocation for Vision-Language Authors:Chenjie Yang, Yutian Jiang, Chenyu Wu View a PDF of the paper titled Skill-Conditioned Visual Geolocation for Vision-Language, by Chenjie Yang and 1 other authors View PDF HTML (experimental) Abstract:Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes a...