[2602.14917] BFS-PO: Best-First Search for Large Reasoning Models

[2602.14917] BFS-PO: Best-First Search for Large Reasoning Models

arXiv - AI 3 min read Article

Summary

The paper proposes BFS-PO, a new reinforcement learning algorithm that enhances the performance of Large Reasoning Models by reducing computational costs and output verbosity through a Best-First Search strategy.

Why It Matters

As Large Reasoning Models become increasingly prevalent in AI applications, addressing issues like overthinking and verbosity is crucial for improving efficiency and usability. BFS-PO offers a novel approach to streamline reasoning processes, potentially leading to more effective AI systems.

Key Takeaways

  • BFS-PO reduces computational costs associated with Large Reasoning Models.
  • The algorithm addresses the issue of verbose outputs in AI reasoning.
  • Utilizes a Best-First Search strategy to find concise answers.
  • Demonstrates improved accuracy and shorter responses in benchmarks.
  • Offers a backtracking mechanism based on maximum entropy nodes.

Computer Science > Computation and Language arXiv:2602.14917 (cs) [Submitted on 16 Feb 2026] Title:BFS-PO: Best-First Search for Large Reasoning Models Authors:Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara View a PDF of the paper titled BFS-PO: Best-First Search for Large Reasoning Models, by Fiorenzo Parascandolo and 4 other authors View PDF HTML (experimental) Abstract:Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.14917 [cs.CL]   (or arXiv:2602.14917v1 [cs.CL] for this version)   https://doi.org/10.48550...

Related Articles

Machine Learning

Fed Chair Jerome Powell, Treasury's Bessent and top bank CEOs met over Anthropic's Mythos model

Reddit - Artificial Intelligence · 1 min ·
CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%
Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min ·
New AI model sparks alarm as governments brace for AI-driven cyberattacks
Machine Learning

New AI model sparks alarm as governments brace for AI-driven cyberattacks

AI Tools & Products · 6 min ·
Machine Learning

Anthropic Model Scare Sparks Urgent Bessent, Powell Warning to Bank CEOs

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. ...

AI Tools & Products · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime