[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model
Summary
The paper presents UniWeTok, a unified binary tokenizer with a massive codebook size of 2^128, designed to enhance multimodal large language models (MLLMs) by improving visual representation and semantic extraction capabilities.
Why It Matters
As multimodal large language models become increasingly important in AI, UniWeTok addresses the challenges of integrating high-fidelity visual representations with generative capabilities. This advancement could significantly improve the performance of AI systems in tasks requiring both visual and textual understanding, making it relevant for researchers and developers in the field.
Key Takeaways
- UniWeTok utilizes a binary codebook of size 2^128 to improve visual representation in MLLMs.
- The proposed training framework enhances semantic extraction and generative capabilities.
- Achieves state-of-the-art performance on ImageNet with significantly lower training compute requirements.
- Introduces a convolution-attention hybrid architecture with a novel SigLu activation function.
- Code and models are released for community exploration, promoting further research.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14178 (cs) [Submitted on 15 Feb 2026] Title:UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model Authors:Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang View a PDF of the paper titled UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model, by Shaobin Zhuang and 14 other authors View PDF HTML (experimental) Abstract:Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic dis...