[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2512.02425: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2512.02425 (cs) [Submitted on 2 Dec 2025 (v1), last revised 27 Mar 2026 (this version, v2)] Title:WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning Authors:Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang View a PDF of the paper titled WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning, by Woongyeong Yeo and 3 other authors View PDF HTML (experimental) Abstract:Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detaile...

Originally published on March 30, 2026. Curated by AI News.

Related Articles

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV
Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min ·
[2603.16629] MLLM-based Textual Explanations for Face Comparison
Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min ·
[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation
Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min ·
[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding
Llms

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime