[2602.13953] QuRL: Efficient Reinforcement Learning with Quantized Rollout
Summary
The paper introduces Quantized Reinforcement Learning (QuRL), a method aimed at improving the efficiency of reinforcement learning in large language models by accelerating the rollout process through quantization techniques.
Why It Matters
As reinforcement learning becomes increasingly important in training large language models, addressing efficiency bottlenecks is crucial. QuRL's approach to quantization could significantly reduce training time, making it a valuable contribution to the field of machine learning.
Key Takeaways
- QuRL accelerates the rollout process in reinforcement learning, reducing training time by 20% to 80%.
- The method introduces Adaptive Clipping Range (ACR) to adjust clipping ratios dynamically.
- Invariant scaling technique is proposed to mitigate quantization noise during weight updates.
- The research demonstrates practical applications through experiments with INT8 and FP8 quantization.
- QuRL addresses key challenges in reinforcement learning with verifiable rewards.
Computer Science > Machine Learning arXiv:2602.13953 (cs) [Submitted on 15 Feb 2026] Title:QuRL: Efficient Reinforcement Learning with Quantized Rollout Authors:Yuhang Li, Reena Elangovan, Xin Dong, Priyadarshini Panda, Brucek Khailany View a PDF of the paper titled QuRL: Efficient Reinforcement Learning with Quantized Rollout, by Yuhang Li and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs). However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time. In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout. We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse. Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively. We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update. We evaluate our method with INT8 and FP8 quantization experiments ...