[2602.15922] World Action Models are Zero-shot Policies
Summary
The paper introduces DreamZero, a World Action Model (WAM) that enhances zero-shot policy learning for robotic tasks by predicting future states and actions using video data, achieving significant improvements in generalization and performance.
Why It Matters
This research addresses the limitations of current Vision-Language-Action models in generalizing to new physical tasks. By leveraging video data for action prediction, DreamZero represents a significant advancement in robotics, enabling more efficient learning and adaptability in diverse environments, which is crucial for real-world applications.
Key Takeaways
- DreamZero achieves over 2x improvement in generalization to new tasks compared to existing models.
- The model enables real-time closed-loop control at 7Hz using a 14B autoregressive video diffusion model.
- Cross-embodiment transfer allows for significant performance gains with minimal training data.
- DreamZero supports few-shot embodiment adaptation, retaining zero-shot generalization capabilities.
- The approach highlights the potential of video data in enhancing robotic learning and adaptability.
Computer Science > Robotics arXiv:2602.15922 (cs) [Submitted on 17 Feb 2026] Title:World Action Models are Zero-shot Policies Authors:Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi Wang, Ryan Julian, Danfei Xu, Yilun Du, Yevgen Chebotar, Scott Reed, Jan Kautz, Yuke Zhu, Linxi "Jim" Fan, Joel Jang View a PDF of the paper titled World Action Models are Zero-shot Policies, by Seonghyeon Ye and 35 other authors View PDF HTML (experimental) Abstract:State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through...