[2602.11767] TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

[2602.11767] TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

arXiv - Machine Learning 4 min read Article

Summary

The paper introduces TSR (Trajectory-Search Rollouts), a novel approach to enhance multi-turn reinforcement learning for large language model agents, improving learning stability and performance through optimized trajectory generation.

Why It Matters

As large language models become integral to AI applications, optimizing their training through effective reinforcement learning techniques is crucial. TSR addresses challenges like sparse rewards and stochastic environments, offering a method that enhances agent performance while being adaptable to various optimization frameworks.

Key Takeaways

  • TSR improves multi-turn RL by optimizing trajectory generation.
  • The method enhances learning stability and performance by up to 15%.
  • TSR is optimizer-agnostic, making it versatile across different frameworks.
  • It utilizes tree-style search techniques to select high-scoring actions.
  • The approach shifts search from inference to training, simplifying agent learning.

Computer Science > Artificial Intelligence arXiv:2602.11767 (cs) [Submitted on 12 Feb 2026 (v1), last revised 21 Feb 2026 (this version, v2)] Title:TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents Authors:Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, Holger Boche View a PDF of the paper titled TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents, by Aladin Djuhera and 4 other authors View PDF HTML (experimental) Abstract:Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, Froze...

Related Articles

Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime