[2603.29387] Extend3D: Town-Scale 3D Generation
About this article
Abstract page for arXiv paper 2603.29387: Extend3D: Town-Scale 3D Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.29387 (cs) [Submitted on 31 Mar 2026] Title:Extend3D: Town-Scale 3D Generation Authors:Seungwoo Yoon, Jinmo Kim, Jaesik Park View a PDF of the paper titled Extend3D: Town-Scale 3D Generation, by Seungwoo Yoon and 2 other authors View PDF HTML (experimental) Abstract:In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we...