[2602.14917] BFS-PO: Best-First Search for Large Reasoning Models
Summary
The paper proposes BFS-PO, a new reinforcement learning algorithm that enhances the performance of Large Reasoning Models by reducing computational costs and output verbosity through a Best-First Search strategy.
Why It Matters
As Large Reasoning Models become increasingly prevalent in AI applications, addressing issues like overthinking and verbosity is crucial for improving efficiency and usability. BFS-PO offers a novel approach to streamline reasoning processes, potentially leading to more effective AI systems.
Key Takeaways
- BFS-PO reduces computational costs associated with Large Reasoning Models.
- The algorithm addresses the issue of verbose outputs in AI reasoning.
- Utilizes a Best-First Search strategy to find concise answers.
- Demonstrates improved accuracy and shorter responses in benchmarks.
- Offers a backtracking mechanism based on maximum entropy nodes.
Computer Science > Computation and Language arXiv:2602.14917 (cs) [Submitted on 16 Feb 2026] Title:BFS-PO: Best-First Search for Large Reasoning Models Authors:Fiorenzo Parascandolo, Wenhui Tan, Enver Sangineto, Ruihua Song, Rita Cucchiara View a PDF of the paper titled BFS-PO: Best-First Search for Large Reasoning Models, by Fiorenzo Parascandolo and 4 other authors View PDF HTML (experimental) Abstract:Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown excellent performance in reasoning tasks using long reasoning chains. However, this has also led to a significant increase of computational costs and the generation of verbose output, a phenomenon known as overthinking. The tendency to overthinking is often exacerbated by Reinforcement Learning (RL) algorithms such as GRPO/DAPO. In this paper, we propose BFS-PO, an RL algorithm which alleviates this problem using a Best-First Search exploration strategy. Specifically, BFS-PO looks for the shortest correct answer using a backtracking mechanism based on maximum entropy nodes. By generating progressively shorter responses during training, BFS-PO learns to produce concise reasoning chains. Using different benchmarks and base LRMs, we show that BFS-PO can simultaneously increase the LRM accuracy and shorten its answers. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.14917 [cs.CL] (or arXiv:2602.14917v1 [cs.CL] for this version) https://doi.org/10.48550...