[2602.17684] CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
Summary
The paper presents CodeScaler, an execution-free reward model that enhances the scalability of code LLM training and test-time inference, outperforming traditional methods.
Why It Matters
As code generation becomes increasingly vital in software development, improving the efficiency and effectiveness of training models is crucial. CodeScaler addresses scalability issues in reinforcement learning by eliminating the dependency on execution-based feedback, which can be unreliable and limited. This advancement could lead to more robust AI systems capable of generating code with higher accuracy and reduced latency.
Key Takeaways
- CodeScaler improves code LLM performance by an average of +11.72 points across benchmarks.
- It enables scalable reinforcement learning without the need for test cases.
- The model achieves a 10-fold reduction in latency compared to traditional unit test approaches.
- CodeScaler surpasses existing reward models in both code and general reasoning tasks.
- It utilizes syntax-aware code extraction for stable optimization.
Computer Science > Machine Learning arXiv:2602.17684 (cs) [Submitted on 4 Feb 2026] Title:CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models Authors:Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo View a PDF of the paper titled CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models, by Xiao Zhu and 7 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable...