[2603.01223] Learn Hard Problems During RL with Reference Guided Fine-tuning
About this article
Abstract page for arXiv paper 2603.01223: Learn Hard Problems During RL with Reference Guided Fine-tuning
Computer Science > Machine Learning arXiv:2603.01223 (cs) [Submitted on 1 Mar 2026] Title:Learn Hard Problems During RL with Reference Guided Fine-tuning Authors:Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai View a PDF of the paper titled Learn Hard Problems During RL with Reference Guided Fine-tuning, by Yangzhen Wu and Shanda Li and Zixin Wen and Xin Zhou and Ameet Talwalkar and Yiming Yang and Wenhao Huang and Tianle Cai View PDF HTML (experimental) Abstract:Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these ref...