[2511.18746] Any4D: Open-Prompt 4D Generation from Natural Language and Images
About this article
Abstract page for arXiv paper 2511.18746: Any4D: Open-Prompt 4D Generation from Natural Language and Images
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.18746 (cs) This paper has been withdrawn by Qiao Sun [Submitted on 24 Nov 2025 (v1), last revised 27 Mar 2026 (this version, v2)] Title:Any4D: Open-Prompt 4D Generation from Natural Language and Images Authors:Hao Li, Qiao Sun View a PDF of the paper titled Any4D: Open-Prompt 4D Generation from Natural Language and Images, by Hao Li and 1 other authors No PDF available, click to view other formats Abstract:While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decr...