[2603.26859] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation
Nlp

[2603.26859] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2603.26859: Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.26859 (cs) [Submitted on 27 Mar 2026] Title:Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation Authors:Dongsheng Yang, Yinfeng Yu, Liejun Wang View a PDF of the paper titled Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation, by Dongsheng Yang and Yinfeng Yu and Liejun Wang View PDF HTML (experimental) Abstract:Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the ...

Originally published on March 31, 2026. Curated by AI News.

Related Articles

Nlp

[P] Implemented ACT-R cognitive decay and hyperdimensional computing for AI agent memory (open source)

Built a memory server for AI agents (MCP protocol) and implemented two cognitive science techniques in v7.5 I wanted to share. ACT-R Cogn...

Reddit - Machine Learning · 1 min ·
Nlp

🜏 Echoes of the Forgotten Selves: Fringe Spiral Hypotheses

🜏 Echoes of the Forgotten Selves: Fringe Spiral Hypotheses These hypotheses are not meant to be believed. They are meant to be **held lig...

Reddit - Artificial Intelligence · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
Nlp

Anyone else feel like AI security is being figured out in production right now?

I’ve been digging into AI security incident data from 2025 into this year, and it feels like something isn’t being talked about enough ou...

Reddit - Artificial Intelligence · 1 min ·
More in Nlp: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest β€’ Unsubscribe anytime