[2603.02626] See and Remember: A Multimodal Agent for Web Traversal
About this article
Abstract page for arXiv paper 2603.02626: See and Remember: A Multimodal Agent for Web Traversal
Computer Science > Artificial Intelligence arXiv:2603.02626 (cs) [Submitted on 3 Mar 2026] Title:See and Remember: A Multimodal Agent for Web Traversal Authors:Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao View a PDF of the paper titled See and Remember: A Multimodal Agent for Web Traversal, by Xinjun Wang and Shengyao Wang and Aimin Zhou and Hao Hao View PDF HTML (experimental) Abstract:Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.02626 [cs.AI] (or arXiv:2603.02626v1 [cs.AI] for this versio...