[2602.22576] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Summary
The paper presents Search-P1, a framework for enhancing Retrieval-Augmented Generation (RAG) training through path-centric reward shaping, improving reasoning accuracy in large language models (LLMs).
Why It Matters
This research addresses the limitations of traditional RAG methods by introducing a more efficient training framework that enhances the performance of LLMs in complex reasoning tasks. The findings could significantly impact the development of AI systems that require reliable multi-step reasoning capabilities, making it relevant for researchers and practitioners in AI and machine learning.
Key Takeaways
- Search-P1 introduces path-centric reward shaping to improve RAG training.
- The framework allows LLMs to learn from both successful and failed reasoning attempts.
- Experiments show an average accuracy improvement of 7.7 points over existing methods.
Computer Science > Computation and Language arXiv:2602.22576 (cs) [Submitted on 26 Feb 2026] Title:Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training Authors:Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li, Linfang Shang, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang View a PDF of the paper titled Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training, by Tianle Xia and 9 other authors View PDF HTML (experimental) Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonst...