Machine Learning Ai Infrastructure Llms Ai Agents

[2602.14917] BFS-PO: Best-First Search for Large Reasoning Models

arXiv - AI February 17, 2026 3 min read Article

Summary

The paper proposes BFS-PO, a new reinforcement learning algorithm that enhances the performance of Large Reasoning Models by reducing computational costs and output verbosity through a Best-First Search strategy.

Why It Matters

As Large Reasoning Models become increasingly prevalent in AI applications, addressing issues like overthinking and verbosity is crucial for improving efficiency and usability. BFS-PO offers a novel approach to streamline reasoning processes, potentially leading to more effective AI systems.

Key Takeaways

BFS-PO reduces computational costs associated with Large Reasoning Models.
The algorithm addresses the issue of verbose outputs in AI reasoning.
Utilizes a Best-First Search strategy to find concise answers.
Demonstrates improved accuracy and shorter responses in benchmarks.
Offers a backtracking mechanism based on maximum entropy nodes.

Computer Science > Computation and Language arXiv:2602.14917 (cs) [Submitted on 16 Feb 2026] Title:BFS-PO: Best-First Search for Large Reasoning Models Authors:Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara View a PDF of the paper titled BFS-PO: Best-First Search for Large Reasoning Models, by Fiorenzo Parascandolo and 4 other authors View PDF HTML (experimental) Abstract:Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.14917 [cs.CL] (or arXiv:2602.14917v1 [cs.CL] for this version) https://doi.org/10.48550...

Read Original Article

[2602.14917] BFS-PO: Best-First Search for Large Reasoning Models

Summary

Why It Matters

Key Takeaways

Related Articles

Fed Chair Jerome Powell, Treasury's Bessent and top bank CEOs met over Anthropic's Mythos model

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

New AI model sparks alarm as governments brace for AI-driven cyberattacks

Anthropic Model Scare Sparks Urgent Bessent, Powell Warning to Bank CEOs

No comments

Stay updated with AI News