[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video

[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

arXiv - AI March 30, 2026 4 min read

About this article

Abstract page for arXiv paper 2512.02425: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2512.02425 (cs) [Submitted on 2 Dec 2025 (v1), last revised 27 Mar 2026 (this version, v2)] Title:WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning Authors:Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang View a PDF of the paper titled WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning, by Woongyeong Yeo and 3 other authors View PDF HTML (experimental) Abstract:Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detaile...

Originally published on March 30, 2026. Curated by AI News.

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · 12 minutes ago

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · about 1 hour ago

Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min · about 1 hour ago

Llms

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding

arXiv - AI · 4 min · about 1 hour ago

[2512.02425] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

About this article

Related Articles

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

[2603.16629] MLLM-based Textual Explanations for Face Comparison

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

No comments

Stay updated with AI News