[2604.04037] Geometric Limits of Knowledge Distillation: A

[2604.04037] Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

arXiv - AI April 07, 2026 3 min read

About this article

Abstract page for arXiv paper 2604.04037: Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Computer Science > Machine Learning arXiv:2604.04037 (cs) [Submitted on 5 Apr 2026] Title:Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory Authors:Dawar Jyoti Deka, Nilesh Sarkar View a PDF of the paper titled Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory, by Dawar Jyoti Deka and 1 other authors View PDF HTML (experimental) Abstract:Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(\alpha)$ features, where $g(\alpha) = 1/((1-\alpha)\ln\frac{1}{1-\alpha})$ is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure $F \approx 28{,}700$ features at $\alpha \approx 0.992$ (critical width $d_S^* \approx 1{,}065$). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ($R^2 = 0.993$). Linear probing shows coarse concepts survive even 88% feature loss, revealing the fl...

Originally published on April 07, 2026. Curated by AI News.

Machine Learning

Accelerating science with AI and simulations

MIT Professor Rafael Gómez-Bombarelli discusses the transformative potential of AI in scientific research, emphasizing its role in materi...

AI News - General · 10 min · 1 minute ago

Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min · about 1 hour ago

Machine Learning

New technique makes AI models leaner and faster while they’re still learning

AI News - General · 9 min · about 1 hour ago

Machine Learning

Fixing Unsupervised Hyperbolic Contrastive Loss [D]

Hello all, I am trying to implement Unsupervised Hyperbolic Contrastive Loss on the ImageNet-1k dataset. My results show that simple Eucl...

Reddit - Machine Learning · 1 min · about 4 hours ago

[2604.04037] Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

About this article

Related Articles

Accelerating science with AI and simulations

Improving AI models’ ability to explain their predictions

New technique makes AI models leaner and faster while they’re still learning

Fixing Unsupervised Hyperbolic Contrastive Loss [D]

No comments

Stay updated with AI News