[2602.15513] Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling
Summary
This paper presents a novel non-parametric memory framework for improving Multimodal Large Language Models (MLLMs) in embodied exploration and question answering, enhancing performance through human-inspired memory modeling.
Why It Matters
The research addresses significant challenges in deploying MLLMs for embodied agents, particularly in dynamic environments. By improving memory modeling, it enhances the efficiency and reasoning capabilities of AI systems, which is crucial for advancing robotics and AI applications.
Key Takeaways
- Introduces a non-parametric memory framework that separates episodic and semantic memory.
- Demonstrates state-of-the-art performance improvements in embodied question answering benchmarks.
- Highlights the importance of episodic memory for exploration efficiency and semantic memory for complex reasoning.
- Offers a retrieval-first, reasoning-assisted approach that enhances memory reuse.
- Provides insights into cross-environment generalization capabilities of embodied agents.
Computer Science > Robotics arXiv:2602.15513 (cs) [Submitted on 17 Feb 2026] Title:Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling Authors:Ji Li, Jing Xia, Mingyi Li, Shiyan Hu View a PDF of the paper titled Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling, by Ji Li and 3 other authors View PDF HTML (experimental) Abstract:Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 1...