[2602.15397] ActionCodec: What Makes for Good Action Tokenizers
Summary
The paper introduces ActionCodec, a novel action tokenizer designed to enhance Vision-Language-Action (VLA) models by optimizing tokenization principles for improved training efficiency and performance.
Why It Matters
As VLA models become increasingly important in AI applications, understanding the principles of effective action tokenization is crucial. This research addresses a gap in the field, providing actionable insights that can lead to better model performance and efficiency, which is vital for advancements in robotics and AI.
Key Takeaways
- Action tokenization significantly impacts VLA model optimization.
- Best practices for action tokenizers include maximizing temporal token overlap and minimizing vocabulary redundancy.
- ActionCodec demonstrates improved training efficiency and performance benchmarks.
- The paper establishes design principles that can guide future developments in action tokenization.
- Achieving a state-of-the-art success rate without robotics pre-training showcases the model's effectiveness.
Computer Science > Robotics arXiv:2602.15397 (cs) [Submitted on 17 Feb 2026] Title:ActionCodec: What Makes for Good Action Tokenizers Authors:Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, Jianye Hao View a PDF of the paper titled ActionCodec: What Makes for Good Action Tokenizers, by Zibin Dong and Yicheng Liu and Shiduo Zhang and Baijun Ye and Yifu Yuan and Fei Ni and Jingjing Gong and Xipeng Qiu and Hang Zhao and Yinchuan Li and Jianye Hao View PDF HTML (experimental) Abstract:Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances b...