Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Hugging Face Blog February 15, 2026 9 min read

About this article

A Blog post by ServiceNow-AI on Hugging Face

Back to Articles Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models Enterprise Article Published November 19, 2025 Upvote 34 +28 Torsten Scholak tscholak Follow ServiceNow-AI Oleksiy Ostapenko ostapeno Follow ServiceNow-AI Raymond Li RaymondLi Follow ServiceNow-AI Luke Kumar nitsanluke Follow ServiceNow-AI Joel Lamy-Poirier jlamypoirier Follow ServiceNow-AI We converted our 15B reasoning model to a Mamba hybrid achieving 2.1x throughput with minimal quality loss. The key? A non-obvious insight about what data to distill on, and why intuition fails here. When MiniMax published their M2 post-mortem in October explaining why they abandoned efficient attention at 230B scale, the narrative briefly became "efficient attention is dead." Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints. Our constraint was simple: we had a strong 15B reasoning model and needed to make it efficient without starting over. No infinite compute for 20T-token pretraining. No luxury of architectural co-design from day one. Just a practical question: can you retrofit efficiency into an existing model through distillation? Spoilers: yes, but only if you ignore your intuition about what data to use. What We Built The Apriel-H1 family: seven checkpoints spanning 25-40 Mamba layers (out of 50 total), showing the complete efficiency-quality frontier. Our flagship Apriel-H1-15b-Thinker-SFT achieves 2.1x throughput with minimal quality loss: MATH500 an...

Originally published on February 15, 2026. Curated by AI News.

Open Source Ai

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

A Blog post by IBM Granite on Hugging Face

Hugging Face Blog · 7 min · about 9 hours ago

Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min · about 14 hours ago

Llms

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

TL;DR: Removing the right transformer layers (instead of shrinking all layers) gives smaller, faster models with minimal quality loss — a...

Reddit - Artificial Intelligence · 1 min · about 16 hours ago

Llms

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

Abstract page for arXiv paper 2603.16430: EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv - AI · 4 min · about 17 hours ago

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

About this article

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

My AI spent last night modifying its own codebase

Depth-first pruning seems to transfer from GPT-2 to Llama (unexpectedly well)

[2603.16430] EngGPT2: Sovereign, Efficient and Open Intelligence

No comments

Stay updated with AI News