[2507.08831] View Invariant Learning for Vision-Language Navigation in Continuous Environments
Summary
This paper introduces View Invariant Learning (VIL) for enhancing Vision-Language Navigation in Continuous Environments (VLNCE), addressing sensitivity to viewpoint changes in navigation policies.
Why It Matters
The research is significant as it tackles a critical challenge in embodied AI, improving the robustness of navigation systems to varying camera viewpoints. This advancement can enhance the performance of AI agents in real-world applications, making them more reliable and efficient in navigation tasks.
Key Takeaways
- VIL improves navigation policies' robustness to viewpoint changes.
- The proposed method outperforms state-of-the-art approaches by 8-15%.
- VIL serves as a plug-and-play post-training method without diminishing standard performance.
- The approach utilizes a teacher-student framework for knowledge distillation.
- Empirical results validate the effectiveness of VIL on benchmark datasets.
Computer Science > Computer Vision and Pattern Recognition arXiv:2507.08831 (cs) [Submitted on 5 Jul 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:View Invariant Learning for Vision-Language Navigation in Continuous Environments Authors:Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley View a PDF of the paper titled View Invariant Learning for Vision-Language Navigation in Continuous Environments, by Josh Qixuan Sun and 4 other authors View PDF HTML (experimental) Abstract:Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly o...