[2509.24335] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation
About this article
Abstract page for arXiv paper 2509.24335: Hyperspherical Latents Improve Continuous-Token Autoregressive Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2509.24335 (cs) [Submitted on 29 Sep 2025 (v1), last revised 5 Mar 2026 (this version, v2)] Title:Hyperspherical Latents Improve Continuous-Token Autoregressive Generation Authors:Guolin Ke, Hui Xue View a PDF of the paper titled Hyperspherical Latents Improve Continuous-Token Autoregressive Generation, by Guolin Ke and 1 other authors View PDF HTML (experimental) Abstract:Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is t...