[2602.16229] Factored Latent Action World Models

[2602.16229] Factored Latent Action World Models

arXiv - Machine Learning 3 min read Article

Summary

The paper presents the Factored Latent Action Model (FLAM), a new framework for modeling complex dynamics in action-free video generation by decomposing scenes into independent factors, enhancing prediction accuracy and representation quality.

Why It Matters

FLAM addresses limitations in existing models that struggle with multi-entity dynamics, providing a more effective approach for generating and manipulating videos. This advancement is crucial for applications in robotics and AI, where understanding complex interactions is essential.

Key Takeaways

  • FLAM decomposes scenes into independent factors for better modeling.
  • The framework improves prediction accuracy in complex environments.
  • FLAM enhances video generation quality compared to monolithic models.
  • It facilitates downstream policy learning for AI applications.
  • Experimental results show FLAM's superiority on multi-entity datasets.

Computer Science > Machine Learning arXiv:2602.16229 (cs) [Submitted on 18 Feb 2026] Title:Factored Latent Action World Models Authors:Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone View a PDF of the paper titled Factored Latent Action World Models, by Zizhao Wang and 6 other authors View PDF HTML (experimental) Abstract:Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the ben...

Related Articles

Machine Learning

[D] Does ML have a "bible"/reference textbook at the Intermediate/Advanced level?

Hello, everyone! This is my first time posting here and I apologise if the question is, perhaps, a bit too basic for this sub-reddit. A b...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ICML 2026 review policy debate: 100 responses suggest Policy B may score higher, while Policy A shows higher confidence

A week ago I made a thread asking whether ICML 2026’s review policy might have affected review outcomes, especially whether Policy A pape...

Reddit - Machine Learning · 1 min ·
Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles | TechCrunch
Machine Learning

Nomadic raises $8.4 million to wrangle the data pouring off autonomous vehicles | TechCrunch

The company turns footage from robots into structured, searchable datasets with a deep learning model.

TechCrunch - AI · 6 min ·
Machine Learning

[D] Applied AI/Machine learning course by Srikanth Varma

I have all 10 modules of this course, along with all the notes, assignments, and solutions. If anyone need this course DM me. submitted b...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime