[2602.22596] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model
Summary
BetterScene introduces an innovative approach to 3D scene synthesis, enhancing novel view synthesis quality using sparse photos and a representation-aligned generative model.
Why It Matters
This research addresses the limitations of existing novel view synthesis methods by improving the quality of generated scenes using advanced techniques. As 3D scene synthesis plays a crucial role in various applications like virtual reality and gaming, advancements in this field can significantly enhance user experiences and content creation.
Key Takeaways
- BetterScene enhances novel view synthesis (NVS) quality using sparse photos.
- It leverages a pretrained Stable Video Diffusion model to mitigate artifacts.
- Introduces temporal equivariance regularization and vision foundation model-aligned representation.
- Integrates 3D Gaussian Splatting for artifact-free and consistent novel views.
- Demonstrates superior performance on the DL3DV-10K dataset compared to existing methods.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22596 (cs) [Submitted on 26 Feb 2026] Title:BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model Authors:Yuci Han, Charles Toth, John E. Anderson, William J. Shuart, Alper Yilmaz View a PDF of the paper titled BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model, by Yuci Han and 4 other authors View PDF HTML (experimental) Abstract:We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the ...