[2602.13977] WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL
Summary
The paper presents WoVR, a novel reinforcement learning framework that enhances the reliability of world models for Vision-Language-Action (VLA) policies, addressing issues of hallucination and error accumulation in simulated environments.
Why It Matters
As reinforcement learning (RL) continues to evolve, the ability to effectively simulate environments is crucial for training robust AI systems. WoVR's approach to managing inaccuracies in world models could significantly improve the deployment of RL in real-world robotic applications, enhancing both stability and performance.
Key Takeaways
- WoVR regulates RL interactions with imperfect world models to improve stability.
- Keyframe-Initialized Rollouts help reduce effective error depth in simulations.
- The framework demonstrates a significant increase in success rates for robotic manipulation tasks.
Computer Science > Robotics arXiv:2602.13977 (cs) [Submitted on 15 Feb 2026] Title:WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL Authors:Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, Dongbin Zhao View a PDF of the paper titled WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL, by Zhennan Jiang and 13 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce...