[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
About this article
Abstract page for arXiv paper 2512.02425: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.02425 (cs) [Submitted on 2 Dec 2025 (v1), last revised 27 Mar 2026 (this version, v2)] Title:WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning Authors:Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang View a PDF of the paper titled WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning, by Woongyeong Yeo and 3 other authors View PDF HTML (experimental) Abstract:Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detaile...