[2509.21764] CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
About this article
Abstract page for arXiv paper 2509.21764: CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones
Computer Science > Computer Vision and Pattern Recognition arXiv:2509.21764 (cs) [Submitted on 26 Sep 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones Authors:Wenyi Gong, Mieszko Lis View a PDF of the paper titled CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones, by Wenyi Gong and 1 other authors View PDF HTML (experimental) Abstract:Many modern ViT backbones adopt spatial architectural designs, such as window attention, decomposed relative positional embeddings in SAM, and RoPE in DINOv3. Such architectures impose new challenges on token reduction, as the vast majority of existing methods fail to preserve the spatial structure these architectures depend on. In this paper, we introduce a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. We reconcile two seemingly conflicting requirements: (i)exploiting the uneven information distribution across the spatial layout while (ii)preserving the spatial structure post-merging. Our approach employs (i)a 2D reduction strategy to enforce structured token layouts, (ii)a spatial-aware merging algorithm that maintains relative token positions, and (iii)a novel max-magnitude-per-dimension token representation that preserves salient features. Our method demonstrates strong performance both off-the-shelf and with fine-tuning, achieving state-of-the-a...