[2604.16552] Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
About this article
Abstract page for arXiv paper 2604.16552: Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.16552 (cs) [Submitted on 17 Apr 2026 (v1), last revised 29 Apr 2026 (this version, v2)] Title:Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion Authors:Zhenggang Tang, Yuehao Wang, Yuchen Fan, Jun-Kun Chen, Yu-Ying Yeh, Kihyuk Sohn, Zhangyang Wang, Qixing Huang, Alexander Schwing, Rakesh Ranjan, Dilin Wang, Zhicheng Yan View a PDF of the paper titled Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion, by Zhenggang Tang and 11 other authors View PDF HTML (experimental) Abstract:Recent text-to-scene generation approaches largely reduced the manual efforts required to create 3D scenes. However, their focus is either to generate a scene layout or to generate objects, and few generate both. The generated scene layout is often simple even with LLM's help. Moreover, the generated scene is often inconsistent with the text input that contains non-trivial descriptions of the shape, appearance, and spatial arrangement of the objects. We present a new paradigm of sequential text-to-scene generation and propose a novel generative model for interactive scene creation. At the core is a 3D Autoregressive Diffusion model 3D-ARD+, which unifies the autoregressive generation over a multimodal token sequence and diffusion generation of next-object 3D latents. To generate the next object, the model uses one autoregressive step to generate the coarse-grained 3D latents in the...