[2602.16839] Training Large Reasoning Models Efficiently via Progressive Thought Encoding
Summary
This paper presents Progressive Thought Encoding, a novel method for training large reasoning models (LRMs) that enhances efficiency and accuracy by reducing memory usage during reinforcement learning.
Why It Matters
As large reasoning models become increasingly integral to AI applications, optimizing their training processes is crucial. This research addresses significant challenges in memory management and efficiency, making it relevant for developers and researchers in machine learning and AI.
Key Takeaways
- Progressive Thought Encoding reduces memory usage during RL training without sacrificing performance.
- The method shows significant improvements in reasoning accuracy across multiple models and benchmarks.
- It allows for effective reasoning under fixed-size caches, addressing a critical barrier in LRM training.
Computer Science > Machine Learning arXiv:2602.16839 (cs) [Submitted on 18 Feb 2026] Title:Training Large Reasoning Models Efficiently via Progressive Thought Encoding Authors:Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao View a PDF of the paper titled Training Large Reasoning Models Efficiently via Progressive Thought Encoding, by Zeliang Zhang and 5 other authors View PDF HTML (experimental) Abstract:Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-t...