[2510.15301] Latent Diffusion Model without Variational Autoencoder
About this article
Abstract page for arXiv paper 2510.15301: Latent Diffusion Model without Variational Autoencoder
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.15301 (cs) [Submitted on 17 Oct 2025 (v1), last revised 2 Mar 2026 (this version, v4)] Title:Latent Diffusion Model without Variational Autoencoder Authors:Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu View a PDF of the paper titled Latent Diffusion Model without Variational Autoencoder, by Minglei Shi and 8 other authors View PDF HTML (experimental) Abstract:Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidel...