[2603.19835] FIPO: Eliciting Deep Reasoning with Future-KL Influenced

[2603.19835] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

arXiv - Machine Learning March 23, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.19835: FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Computer Science > Machine Learning arXiv:2603.19835 (cs) [Submitted on 20 Mar 2026] Title:FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization Authors:Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou View a PDF of the paper titled FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization, by Chiyu Ma and 9 other authors View PDF HTML (experimental) Abstract:We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at ...

Originally published on March 23, 2026. Curated by AI News.

Llms

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

I'm a strategy person by background. Two years ago I'd write a recommendation and hand it to a product team. Now.. I describe what I want...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

Block Resets Management With AI As Cash App Adds Installment Transfers

Block (NYSE:XYZ) plans a permanent organizational overhaul that replaces many middle management roles with AI-driven models to create fla...

AI Tools & Products · 5 min · about 7 hours ago

Llms

Anthropic leaks source code for its AI coding agent Claude

Anthropic accidentally exposed roughly 512,000 lines of proprietary TypeScript source code for its AI-powered coding agent Claude Code

AI Tools & Products · 3 min · about 7 hours ago

Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

It even has Minesweeper.

AI Tools & Products · 3 min · about 7 hours ago

[2603.19835] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

About this article

Related Articles

The person who replaces you probably won't be AI. It'll be someone from the next department over who learned to use it - opinion/discussion

Block Resets Management With AI As Cash App Adds Installment Transfers

Anthropic leaks source code for its AI coding agent Claude

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

No comments

Stay updated with AI News