[2602.22247] Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations

[2602.22247] Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations

arXiv - Machine Learning 4 min read Article

Summary

This article explores how single-cell foundation models like scGPT encode biological knowledge through high-dimensional gene representations, revealing a structured biological coordinate system.

Why It Matters

Understanding the geometric structure of gene representations in models like scGPT is crucial for advancing genomics and artificial intelligence. It provides insights into cellular organization, regulatory networks, and potential applications in drug discovery and model auditing.

Key Takeaways

  • scGPT organizes genes into a structured biological coordinate system.
  • The model's spectral axes reveal insights into protein localization and interactions.
  • Early transformer layers maintain specific gene regulatory information, while deeper layers generalize this into broader categories.
  • The findings enhance our understanding of cellular organization and regulatory networks.
  • Implications include improved drug target prioritization and model auditing.

Quantitative Biology > Genomics arXiv:2602.22247 (q-bio) [Submitted on 24 Feb 2026] Title:Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations Authors:Ihor Kendiukhov View a PDF of the paper titled Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations, by Ihor Kendiukhov View PDF HTML (experimental) Abstract:Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes (AUROC = 0.744, all 12 ...

Related Articles

Llms

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, ...

Reddit - Machine Learning · 1 min ·
Llms

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

Been working on a weight divergence trajectory curvature approach to detecting neural network training instability. Treats weight updates...

Reddit - Artificial Intelligence · 1 min ·
Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime