I recently tested Gemma 4-31B locally and I was blown away with the intelligence/size ratio of this model. These papers show how they achieved such distillation capabilities.[R]
About this article
The secret sauce here is that the student model does not just try to guess the next token in a sentence, which is how most AI is trained. Instead, the teacher model shares its entire "thought process" for every single word. It gives the student a detailed probability distribution, which is rather counterintuitive if you want to build something smaller! This gives the student much "richer" information at every step and allows it to learn way more efficiently than it could on its own. Because o...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket