[2605.01248] S^3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
About this article
Abstract page for arXiv paper 2605.01248: S^3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
Computer Science > Machine Learning arXiv:2605.01248 (cs) [Submitted on 2 May 2026] Title:S^3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data Authors:Harsh Goel, Akhil Udathu, Susmija Jabireddy, Pradnesh Kalkar, Atharva Parulekar View a PDF of the paper titled S^3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data, by Harsh Goel and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) post-training has enabled newer capabilities in models, such as agentic tool-use for search. However, these models struggle primarily due to limitations with sparse outcome-based rewards and a lack of training data that encapsulates questions of differing hardness, which results in models not performing deeper searches with tools to collect evidence for question-answering. To address these limitations, we introduce S^3-R1 (Synthetic data and stabilized Search R1), a framework that couples a data-centric approach with denser learning signals. We first develop a synthetic generation and curation pipeline that programmatically derives diverse, multi-hop questions from existing documents. This pipeline incorporates a retrieval-based verification step to specifically isolate questions of intermediate difficulty. We then pair this expanded training set with a reward structure that evaluates both intermediate search quality and the correctness of the final answer. This setup directly mitigates the credit assignment problems inherent...