[2602.22247] Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations
Summary
This article explores how single-cell foundation models like scGPT encode biological knowledge through high-dimensional gene representations, revealing a structured biological coordinate system.
Why It Matters
Understanding the geometric structure of gene representations in models like scGPT is crucial for advancing genomics and artificial intelligence. It provides insights into cellular organization, regulatory networks, and potential applications in drug discovery and model auditing.
Key Takeaways
- scGPT organizes genes into a structured biological coordinate system.
- The model's spectral axes reveal insights into protein localization and interactions.
- Early transformer layers maintain specific gene regulatory information, while deeper layers generalize this into broader categories.
- The findings enhance our understanding of cellular organization and regulatory networks.
- Implications include improved drug target prioritization and model auditing.
Quantitative Biology > Genomics arXiv:2602.22247 (q-bio) [Submitted on 24 Feb 2026] Title:Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations Authors:Ihor Kendiukhov View a PDF of the paper titled Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations, by Ihor Kendiukhov View PDF HTML (experimental) Abstract:Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 ...