[2510.24694] Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Summary
The paper presents E-GRPO, a novel framework for training search agents using synthetic data, enhancing their ability to learn from near-miss samples and improving accuracy in complex tasks.
Why It Matters
This research addresses a critical limitation in current training methods for search agents, which often overlook valuable learning signals from near-miss samples. By introducing E-GRPO, the study enhances the efficiency and effectiveness of search agents, making it relevant for advancements in AI and machine learning applications.
Key Takeaways
- E-GRPO improves learning by utilizing near-miss samples.
- The framework assigns partial rewards based on entity match rates.
- Empirical results show significant performance improvement over GRPO.
- E-GRPO leads to more efficient reasoning policies with fewer tool calls.
- The study highlights the importance of entity-centric training in AI.
Computer Science > Computation and Language arXiv:2510.24694 (cs) [Submitted on 28 Oct 2025 (v1), last revised 24 Feb 2026 (this version, v2)] Title:Repurposing Synthetic Data for Fine-grained Search Agent Supervision Authors:Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang View a PDF of the paper titled Repurposing Synthetic Data for Fine-grained Search Agent Supervision, by Yida Zhao and 13 other authors View PDF HTML (experimental) Abstract:LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware...