[2602.16229] Factored Latent Action World Models
Summary
The paper presents the Factored Latent Action Model (FLAM), a new framework for modeling complex dynamics in action-free video generation by decomposing scenes into independent factors, enhancing prediction accuracy and representation quality.
Why It Matters
FLAM addresses limitations in existing models that struggle with multi-entity dynamics, providing a more effective approach for generating and manipulating videos. This advancement is crucial for applications in robotics and AI, where understanding complex interactions is essential.
Key Takeaways
- FLAM decomposes scenes into independent factors for better modeling.
- The framework improves prediction accuracy in complex environments.
- FLAM enhances video generation quality compared to monolithic models.
- It facilitates downstream policy learning for AI applications.
- Experimental results show FLAM's superiority on multi-entity datasets.
Computer Science > Machine Learning arXiv:2602.16229 (cs) [Submitted on 18 Feb 2026] Title:Factored Latent Action World Models Authors:Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone View a PDF of the paper titled Factored Latent Action World Models, by Zizhao Wang and 6 other authors View PDF HTML (experimental) Abstract:Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the ben...