Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models
About this article
A Blog post by ServiceNow-AI on Hugging Face
Back to Articles Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models Enterprise Article Published November 19, 2025 Upvote 34 +28 Torsten Scholak tscholak Follow ServiceNow-AI Oleksiy Ostapenko ostapeno Follow ServiceNow-AI Raymond Li RaymondLi Follow ServiceNow-AI Luke Kumar nitsanluke Follow ServiceNow-AI Joel Lamy-Poirier jlamypoirier Follow ServiceNow-AI We converted our 15B reasoning model to a Mamba hybrid achieving 2.1x throughput with minimal quality loss. The key? A non-obvious insight about what data to distill on, and why intuition fails here. When MiniMax published their M2 post-mortem in October explaining why they abandoned efficient attention at 230B scale, the narrative briefly became "efficient attention is dead." Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints. Our constraint was simple: we had a strong 15B reasoning model and needed to make it efficient without starting over. No infinite compute for 20T-token pretraining. No luxury of architectural co-design from day one. Just a practical question: can you retrofit efficiency into an existing model through distillation? Spoilers: yes, but only if you ignore your intuition about what data to use. What We Built The Apriel-H1 family: seven checkpoints spanning 25-40 Mamba layers (out of 50 total), showing the complete efficiency-quality frontier. Our flagship Apriel-H1-15b-Thinker-SFT achieves 2.1x throughput with minimal quality loss: MATH500 an...