[2602.10603] dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Summary
The paper presents dnaHNet, a novel tokenizer-free autoregressive model designed for genomic sequence learning, achieving significant efficiency and performance improvements over existing models.
Why It Matters
As genomic data continues to grow, efficient models like dnaHNet are crucial for advancing bioinformatics. This model addresses key challenges in genomic sequence representation, potentially accelerating research in genetics and molecular biology.
Key Takeaways
- dnaHNet introduces a tokenizer-free approach for genomic sequences.
- The model employs a dynamic chunking mechanism for improved efficiency.
- It outperforms existing models in both speed and predictive accuracy.
- Achieves significant reductions in computational costs, enhancing scalability.
- Demonstrates superior performance in zero-shot genomic tasks.
Computer Science > Machine Learning arXiv:2602.10603 (cs) [Submitted on 11 Feb 2026 (v1), last revised 14 Feb 2026 (this version, v2)] Title:dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning Authors:Arnav Shah, Junzhe Li, Parsa Idehpour, Adibvafa Fallahpour, Brandon Wang, Sukjun Hwang, Bo Wang, Patrick D. Hsu, Hani Goodarzi, Albert Gu View a PDF of the paper titled dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning, by Arnav Shah and 9 other authors View PDF HTML (experimental) Abstract:Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet a...