[2604.03552] CRAFT: Video Diffusion for Bimanual Robot Data Generation
About this article
Abstract page for arXiv paper 2604.03552: CRAFT: Video Diffusion for Bimanual Robot Data Generation
Computer Science > Robotics arXiv:2604.03552 (cs) [Submitted on 4 Apr 2026] Title:CRAFT: Video Diffusion for Bimanual Robot Data Generation Authors:Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita View a PDF of the paper titled CRAFT: Video Diffusion for Bimanual Robot Data Generation, by Jason Chen and 3 other authors View PDF HTML (experimental) Abstract:Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic t...