[2509.15130] Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
About this article
Abstract page for arXiv paper 2509.15130: Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Computer Science > Graphics arXiv:2509.15130 (cs) [Submitted on 18 Sep 2025 (v1), last revised 21 Mar 2026 (this version, v3)] Title:Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control Authors:Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang View a PDF of the paper titled Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control, by Chenxi Song and 4 other authors View PDF HTML (experimental) Abstract:Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping, often introduce visual artifacts, fail to generalize, or incur high computational costs. We introduce WorldForge, a novel, training-free framework that operates purely at inference time to resolve these issues. Our method comprises three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided g...