[2511.11758] Protein Structure Tokenization via Geometric Byte Pair Encoding
About this article
Abstract page for arXiv paper 2511.11758: Protein Structure Tokenization via Geometric Byte Pair Encoding
Quantitative Biology > Quantitative Methods arXiv:2511.11758 (q-bio) [Submitted on 13 Nov 2025 (v1), last revised 1 Mar 2026 (this version, v2)] Title:Protein Structure Tokenization via Geometric Byte Pair Encoding Authors:Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik View a PDF of the paper titled Protein Structure Tokenization via Geometric Byte Pair Encoding, by Michael Sun and 4 other authors View PDF HTML (experimental) Abstract:Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10x...