[2602.13818] VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer
Summary
The VAR-3D model introduces a novel approach to text-to-3D generation, addressing challenges in discrete 3D representation and enhancing geometric coherence through a view-aware auto-regressive framework.
Why It Matters
As the demand for realistic 3D models from textual descriptions grows, improving the fidelity and coherence of generated models is crucial. VAR-3D's advancements in integrating view-aware techniques and rendering-supervised training could significantly impact industries like gaming, virtual reality, and design.
Key Takeaways
- VAR-3D enhances text-to-3D generation by addressing encoding bottlenecks.
- The model integrates a view-aware 3D VQ-VAE for better geometric representation.
- A rendering-supervised training strategy improves visual fidelity and structural consistency.
- Experiments show VAR-3D outperforms existing methods in generation quality.
- The approach could revolutionize applications in gaming and virtual environments.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13818 (cs) [Submitted on 14 Feb 2026] Title:VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer Authors:Zongcheng Han, Dongyan Cao, Haoran Sun, Yu Hong View a PDF of the paper titled VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer, by Zongcheng Han and 2 other authors View PDF HTML (experimental) Abstract:Recent advances in auto-regressive transformers have achieved remarkable success in generative modeling. However, text-to-3D generation remains challenging, primarily due to bottlenecks in learning discrete 3D representations. Specifically, existing approaches often suffer from information loss during encoding, causing representational distortion before the quantization process. This effect is further amplified by vector quantization, ultimately degrading the geometric coherence of text-conditioned 3D shapes. Moreover, the conventional two-stage training paradigm induces an objective mismatch between reconstruction and text-conditioned auto-regressive generation. To address these issues, we propose View-aware Auto-Regressive 3D (VAR-3D), which intergrates a view-aware 3D Vector Quantized-Variational AutoEncoder (VQ-VAE) to convert the complex geometric structure of 3D models into discrete tokens. Additionally, we introduce a rendering-supervised training strategy that couples discrete token prediction with visual reconstruction, enc...