Llms Machine Learning Computer Vision Ai Agents

[2602.15724] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

arXiv - AI February 18, 2026 4 min read Article

Summary

This paper presents a retrieval-augmented framework to enhance efficiency in Vision-and-Language Navigation (VLN) by leveraging large language models (LLMs) for better decision-making.

Why It Matters

As VLN systems become increasingly reliant on LLMs, improving their efficiency and stability is crucial for real-world applications. This research addresses common pitfalls in LLM navigation, offering a novel approach that could enhance performance in diverse environments, making it relevant for researchers and practitioners in AI and robotics.

Key Takeaways

Introduces a retrieval-augmented framework for VLN.
Improves decision-making efficiency without modifying LLMs.
Demonstrates significant performance gains on the Room-to-Room benchmark.
Utilizes lightweight, modular retrieval modules for instruction grounding.
Highlights the complementary benefits of exemplar retrieval and candidate pruning.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15724 (cs) [Submitted on 17 Feb 2026] Title:Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation Authors:Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao View a PDF of the paper titled Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation, by Shutian Gu and 3 other authors View PDF HTML (experimental) Abstract:Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigabl...

Read Original Article

[2602.15724] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Summary

Why It Matters

Key Takeaways

Related Articles

This Is Not Hacking. This Is Structured Intelligence.

[D] Howcome Muon is only being used for Transformers?

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

No comments

Stay updated with AI News