[2602.14444] Broken Chains: The Cost of Incomplete Reasoning in LLMs

[2602.14444] Broken Chains: The Cost of Incomplete Reasoning in LLMs

arXiv - AI 4 min read Article

Summary

The paper explores the impact of incomplete reasoning in large language models (LLMs), revealing how different reasoning modalities affect performance under token constraints.

Why It Matters

Understanding the limitations of reasoning in LLMs is crucial for optimizing their deployment in resource-constrained environments. This research highlights the trade-offs between reasoning modalities, which can inform future model design and application in real-world scenarios.

Key Takeaways

  • Incomplete reasoning can significantly degrade model performance.
  • Code-based reasoning maintains better performance under constraints compared to natural language.
  • Hybrid reasoning approaches tend to underperform compared to single modalities.
  • Model robustness varies significantly, influencing their effectiveness under limited resources.
  • The findings have implications for deploying reasoning-specialized systems efficiently.

Computer Science > Machine Learning arXiv:2602.14444 (cs) [Submitted on 16 Feb 2026] Title:Broken Chains: The Cost of Incomplete Reasoning in LLMs Authors:Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Sunishchal Dev, Yash More, Maheep Chaudhary View a PDF of the paper titled Broken Chains: The Cost of Incomplete Reasoning in LLMs, by Ian Su and 7 other authors View PDF HTML (experimental) Abstract:Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and Dee...

Related Articles

Llms

Claude Max 20x usage hit 40% by Monday noon — how does Codex CLI compare?

I'm on Claude Max (the $100/mo plan) and noticed something that surprised me. By Monday noon I had already used 40% of the 20x monthly li...

Reddit - Artificial Intelligence · 1 min ·
How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime