[2601.22776] TSPO: Breaking the Double Homogenization Dilemma in

[2601.22776] TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

arXiv - AI April 07, 2026 3 min read

About this article

Abstract page for arXiv paper 2601.22776: TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Computer Science > Artificial Intelligence arXiv:2601.22776 (cs) [Submitted on 30 Jan 2026 (v1), last revised 6 Apr 2026 (this version, v2)] Title:TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization Authors:Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu, Jintao Du, Yu Cheng, Weiqiang Wang, Qiliang Liu, Zhengyang Zhou, Yang Wang View a PDF of the paper titled TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization, by Shichao Ma and 10 other authors View PDF HTML (experimental) Abstract:Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals ...

Originally published on April 07, 2026. Curated by AI News.

Llms

ChatGPT downloads are slowing — and may cause problems for OpenAI’s IPO | The Verge

Data from Sensor Tower shows ChatGPT’s growth is slowing down, as Claude and other competitors’ growth is increasing, just as OpenAI is p...

The Verge - AI · 4 min · about 1 hour ago

Llms

Larry Ellison’s betting everything on OpenAI. Will it pay off or pop the bubble? | The Verge

Larry Ellison and Oracle have staked their future on a data center deal with OpenAI and a big bet that enterprise AI will pay off.

The Verge - AI · 32 min · about 1 hour ago

Llms

Google just released Deep Research Max — an autonomous research agent that writes expert-grade reports on its own

Google quietly dropped something interesting last week. They updated their Deep Research agent (available via Gemini API) and introduced ...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

From sorting chicken nuggets to screwing in light bulbs, Eka’s robots are eerily lifelike. But do they have real physical smarts?

Wired - AI · 13 min · about 4 hours ago

[2601.22776] TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

About this article

Related Articles

ChatGPT downloads are slowing — and may cause problems for OpenAI’s IPO | The Verge

Larry Ellison’s betting everything on OpenAI. Will it pay off or pop the bubble? | The Verge

Google just released Deep Research Max — an autonomous research agent that writes expert-grade reports on its own

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

No comments

Stay updated with AI News