[2601.16210] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Summary
The paper introduces PyraTok, a language-aligned pyramidal tokenizer designed to enhance video understanding and generation by improving cross-modal alignment and zero-shot transfer capabilities.
Why It Matters
As video content continues to proliferate, effective video understanding and generation systems are crucial. PyraTok addresses limitations in existing tokenizers, offering improved performance in tasks like video segmentation and action localization, which can significantly advance AI applications in multimedia processing.
Key Takeaways
- PyraTok utilizes a pyramidal structure for multi-scale video tokenization.
- It enhances cross-modal alignment between visual tokens and language.
- The approach achieves state-of-the-art performance in video reconstruction and understanding tasks.
- PyraTok is scalable to high-resolution video formats (4K/8K).
- Joint optimization of text-guided quantization improves overall model performance.
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.16210 (cs) [Submitted on 22 Jan 2026 (v1), last revised 23 Feb 2026 (this version, v2)] Title:PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation Authors:Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou View a PDF of the paper titled PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation, by Onkar Susladkar and 6 other authors View PDF Abstract:Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video r...