$[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model$

Llms Machine Learning Generative Ai Computer Vision

[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

arXiv - AI February 17, 2026 4 min read Article

Summary

The paper presents UniWeTok, a unified binary tokenizer with a massive codebook size of 2^128, designed to enhance multimodal large language models (MLLMs) by improving visual representation and semantic extraction capabilities.

Why It Matters

As multimodal large language models become increasingly important in AI, UniWeTok addresses the challenges of integrating high-fidelity visual representations with generative capabilities. This advancement could significantly improve the performance of AI systems in tasks requiring both visual and textual understanding, making it relevant for researchers and developers in the field.

Key Takeaways

UniWeTok utilizes a binary codebook of size 2^128 to improve visual representation in MLLMs.
The proposed training framework enhances semantic extraction and generative capabilities.
Achieves state-of-the-art performance on ImageNet with significantly lower training compute requirements.
Introduces a convolution-attention hybrid architecture with a novel SigLu activation function.
Code and models are released for community exploration, promoting further research.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14178 (cs) [Submitted on 15 Feb 2026] Title:UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model Authors:Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang View a PDF of the paper titled UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model, by Shaobin Zhuang and 14 other authors View PDF HTML (experimental) Abstract:Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic dis...

Read Original Article

[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Summary

Why It Matters

Key Takeaways

Related Articles

What I learned about multi-agent coordination running 9 specialized Claude agents

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

Shifting to AI model customization is an architectural imperative | MIT Technology Review

Artificial intelligence will always depends on human otherwise it will be obsolete.

No comments

Stay updated with AI News