[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

[2602.14178] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

arXiv - AI 4 min read Article

Summary

The paper presents UniWeTok, a unified binary tokenizer with a massive codebook size of 2^128, designed to enhance multimodal large language models (MLLMs) by improving visual representation and semantic extraction capabilities.

Why It Matters

As multimodal large language models become increasingly important in AI, UniWeTok addresses the challenges of integrating high-fidelity visual representations with generative capabilities. This advancement could significantly improve the performance of AI systems in tasks requiring both visual and textual understanding, making it relevant for researchers and developers in the field.

Key Takeaways

  • UniWeTok utilizes a binary codebook of size 2^128 to improve visual representation in MLLMs.
  • The proposed training framework enhances semantic extraction and generative capabilities.
  • Achieves state-of-the-art performance on ImageNet with significantly lower training compute requirements.
  • Introduces a convolution-attention hybrid architecture with a novel SigLu activation function.
  • Code and models are released for community exploration, promoting further research.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14178 (cs) [Submitted on 15 Feb 2026] Title:UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model Authors:Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang View a PDF of the paper titled UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model, by Shaobin Zhuang and 14 other authors View PDF HTML (experimental) Abstract:Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic dis...

Related Articles

Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min ·
Shifting to AI model customization is an architectural imperative | MIT Technology Review
Llms

Shifting to AI model customization is an architectural imperative | MIT Technology Review

In the early days of large language models (LLMs), we grew accustomed to massive 10x jumps in reasoning and coding capability with every ...

MIT Technology Review · 6 min ·
Llms

Artificial intelligence will always depends on human otherwise it will be obsolete.

I was looking for a tool for my specific need. There was not any. So i started to write the program in python, just basic structure. Then...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime