[2601.22776] TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
About this article
Abstract page for arXiv paper 2601.22776: TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
Computer Science > Artificial Intelligence arXiv:2601.22776 (cs) [Submitted on 30 Jan 2026 (v1), last revised 6 Apr 2026 (this version, v2)] Title:TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization Authors:Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu, Jintao Du, Yu Cheng, Weiqiang Wang, Qiliang Liu, Zhengyang Zhou, Yang Wang View a PDF of the paper titled TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization, by Shichao Ma and 10 other authors View PDF HTML (experimental) Abstract:Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals ...