[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss
About this article
TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, and this now works across architectures (GPT-2 + TinyLlama) with near-zero variance. I’ve been experimenting with depth-first pruning — removing entire layers based on sensitivity rather than shrinking model width. Started on GPT-2… Just validated it on TinyLlama 1.1B with full 3-seed replication. 🧠 Results (TinyLlama 1.1B) Depth-First Pruning (3 seeds) Conf...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket