[2602.17739] GeneZip: Region-Aware Compression for Long Context DNA Modeling
Summary
GeneZip introduces a novel DNA compression model that optimizes genomic data representation by focusing on region-aware compression, achieving significant efficiency improvements for long-context DNA modeling.
Why It Matters
As genomic sequences grow in size, efficient data compression becomes crucial for effective analysis and modeling. GeneZip addresses the challenge of genomic data representation by leveraging biological insights, potentially transforming how genomic data is processed and utilized in AI applications.
Key Takeaways
- GeneZip achieves 137.6x compression with minimal increase in perplexity.
- The model utilizes region-aware compression to allocate representation budget effectively.
- GeneZip allows for training larger models on a single GPU, enhancing scalability.
- It demonstrates improved performance on various genomic prediction tasks.
- The approach highlights the importance of biological priors in AI model design.
Quantitative Biology > Genomics arXiv:2602.17739 (q-bio) [Submitted on 19 Feb 2026] Title:GeneZip: Region-Aware Compression for Long Context DNA Modeling Authors:Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang View a PDF of the paper titled GeneZip: Region-Aware Compression for Long Context DNA Modeling, by Jianan Zhao and 5 other authors View PDF HTML (experimental) Abstract:Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, ...