[2510.02284] Learning to Generate Rigid Body Interactions with Video Diffusion Models
About this article
Abstract page for arXiv paper 2510.02284: Learning to Generate Rigid Body Interactions with Video Diffusion Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.02284 (cs) [Submitted on 2 Oct 2025 (v1), last revised 20 Mar 2026 (this version, v3)] Title:Learning to Generate Rigid Body Interactions with Video Diffusion Models Authors:David Romero, Ariana Bermudez, Viacheslav Iablochnikov, Hao Li, Fabio Pizzati, Ivan Laptev View a PDF of the paper titled Learning to Generate Rigid Body Interactions with Video Diffusion Models, by David Romero and 5 other authors View PDF HTML (experimental) Abstract:Recent video generation models have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, current approaches still struggle to generate physically plausible object interactions and lack object-level control mechanisms. To address these limitations, we introduce KineMask, an approach for video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements and generalization to r...