[2602.23153] Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Summary
This article presents Fase3D, an innovative encoder-free Fourier-based model for processing 3D multimodal data, enhancing efficiency and scalability in large multimodal models.
Why It Matters
The development of Fase3D addresses significant challenges in 3D data processing, particularly the inefficiencies of traditional encoder-based models. By utilizing a novel tokenizer and Fourier transformations, this research could lead to advancements in computer vision and AI applications, making 3D modeling more accessible and efficient.
Key Takeaways
- Fase3D eliminates the need for heavy pre-trained visual encoders in 3D models.
- The model uses a unique tokenizer that combines point cloud serialization with Fast Fourier Transform for efficiency.
- Fase3D achieves comparable performance to traditional models while significantly reducing computational requirements.
- The architecture incorporates structured superpoints for compact scene representation.
- Global frequency-aware interactions are integrated at minimal computational cost.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.23153 (cs) [Submitted on 26 Feb 2026] Title:Efficient Encoder-Free Fourier-based 3D Large Multimodal Model Authors:Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Yiming Wang, Fabio Poiesi View a PDF of the paper titled Efficient Encoder-Free Fourier-based 3D Large Multimodal Model, by Guofeng Mei and Wei Lin and Luigi Riz and Yujiao Wu and Yiming Wang and Fabio Poiesi View PDF HTML (experimental) Abstract:Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an...