[2505.07861] Scalable LLM Reasoning Acceleration with Low-rank Distillation
Summary
The paper presents Caprese, a low-rank distillation method designed to enhance reasoning capabilities in large language models (LLMs) while maintaining efficiency and performance in language tasks.
Why It Matters
As LLMs become increasingly integral to various applications, optimizing their reasoning capabilities without sacrificing efficiency is crucial. Caprese addresses the challenge of balancing computational resource demands with performance, making it relevant for researchers and practitioners in AI and machine learning.
Key Takeaways
- Caprese recovers reasoning capabilities lost during efficient inference methods.
- The method requires only 1% additional parameters and 20K synthetic training samples.
- Caprese significantly reduces active parameters and latency in LLMs.
- The approach maintains performance in language tasks while improving reasoning.
- It encourages brevity in responses, reducing token usage by up to 8.5%.
Computer Science > Computation and Language arXiv:2505.07861 (cs) [Submitted on 8 May 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Scalable LLM Reasoning Acceleration with Low-rank Distillation Authors:Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi View a PDF of the paper titled Scalable LLM Reasoning Acceleration with Low-rank Distillation, by Harry Dong and 3 other authors View PDF HTML (experimental) Abstract:Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the reasoning capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>16% time-to-next-token reduction) while encouraging response brevity (up to 8.5% fewer tokens). Subjects: Computation and Lang...