[2602.15724] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
Summary
This paper presents a retrieval-augmented framework to enhance efficiency in Vision-and-Language Navigation (VLN) by leveraging large language models (LLMs) for better decision-making.
Why It Matters
As VLN systems become increasingly reliant on LLMs, improving their efficiency and stability is crucial for real-world applications. This research addresses common pitfalls in LLM navigation, offering a novel approach that could enhance performance in diverse environments, making it relevant for researchers and practitioners in AI and robotics.
Key Takeaways
- Introduces a retrieval-augmented framework for VLN.
- Improves decision-making efficiency without modifying LLMs.
- Demonstrates significant performance gains on the Room-to-Room benchmark.
- Utilizes lightweight, modular retrieval modules for instruction grounding.
- Highlights the complementary benefits of exemplar retrieval and candidate pruning.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15724 (cs) [Submitted on 17 Feb 2026] Title:Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation Authors:Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao View a PDF of the paper titled Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation, by Shutian Gu and 3 other authors View PDF HTML (experimental) Abstract:Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigabl...