[2503.04641] Simulating the Real World: A Unified Survey of Multimodal Generative Models
Summary
This article presents a comprehensive survey of multimodal generative models, focusing on their integration from 2D to 4D representations, aiming to enhance real-world simulations in AI research.
Why It Matters
Understanding and simulating the real world is crucial for advancing Artificial General Intelligence (AGI). This survey addresses the limitations of treating different data modalities independently and proposes a unified framework to improve research and applications in generative models.
Key Takeaways
- The survey unifies 2D, video, 3D, and 4D generative models into a single framework.
- It highlights the interdependencies between different modalities for better simulation accuracy.
- Comprehensive reviews of datasets and evaluation metrics are provided to guide future research.
- The work serves as a foundational resource for newcomers in the field.
- It emphasizes the importance of integrating various dimensions of reality in AI models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2503.04641 (cs) [Submitted on 6 Mar 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Simulating the Real World: A Unified Survey of Multimodal Generative Models Authors:Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong View a PDF of the paper titled Simulating the Real World: A Unified Survey of Multimodal Generative Models, by Yuqi Hu and 9 other authors View PDF HTML (experimental) Abstract:Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates...