[2505.07861] Scalable LLM Reasoning Acceleration with Low-rank Distillation

[2505.07861] Scalable LLM Reasoning Acceleration with Low-rank Distillation

arXiv - Machine Learning 3 min read Article

Summary

The paper presents Caprese, a low-rank distillation method designed to enhance reasoning capabilities in large language models (LLMs) while maintaining efficiency and performance in language tasks.

Why It Matters

As LLMs become increasingly integral to various applications, optimizing their reasoning capabilities without sacrificing efficiency is crucial. Caprese addresses the challenge of balancing computational resource demands with performance, making it relevant for researchers and practitioners in AI and machine learning.

Key Takeaways

  • Caprese recovers reasoning capabilities lost during efficient inference methods.
  • The method requires only 1% additional parameters and 20K synthetic training samples.
  • Caprese significantly reduces active parameters and latency in LLMs.
  • The approach maintains performance in language tasks while improving reasoning.
  • It encourages brevity in responses, reducing token usage by up to 8.5%.

Computer Science > Computation and Language arXiv:2505.07861 (cs) [Submitted on 8 May 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Scalable LLM Reasoning Acceleration with Low-rank Distillation Authors:Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi View a PDF of the paper titled Scalable LLM Reasoning Acceleration with Low-rank Distillation, by Harry Dong and 3 other authors View PDF HTML (experimental) Abstract:Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the reasoning capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>16% time-to-next-token reduction) while encouraging response brevity (up to 8.5% fewer tokens). Subjects: Computation and Lang...

Related Articles

Llms

[R] Hybrid attention for small code models: 50x faster inference, but data scaling still dominates

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Infer...

Reddit - Machine Learning · 1 min ·
Llms

[R] Agentic AI and Occupational Displacement: A Multi-Regional Task Exposure Analysis (236 occupations, 5 US metros)

TL;DR: We extended the Acemoglu-Restrepo task displacement framework to handle agentic AI -- the kind of systems that complete entire wor...

Reddit - Machine Learning · 1 min ·
Llms

Attention Is All You Need, But All You Can't Afford | Hybrid Attention

Repo: https://codeberg.org/JohannaJuntos/Sisyphus I've been building a small Rust-focused language model from scratch in PyTorch. Not a f...

Reddit - Artificial Intelligence · 1 min ·
The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

AI Tools & Products · 12 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime