Llms Machine Learning Nlp Ai Infrastructure Ai Safety

[2602.14444] Broken Chains: The Cost of Incomplete Reasoning in LLMs

arXiv - AI February 17, 2026 4 min read Article

Summary

The paper explores the impact of incomplete reasoning in large language models (LLMs), revealing how different reasoning modalities affect performance under token constraints.

Why It Matters

Understanding the limitations of reasoning in LLMs is crucial for optimizing their deployment in resource-constrained environments. This research highlights the trade-offs between reasoning modalities, which can inform future model design and application in real-world scenarios.

Key Takeaways

Incomplete reasoning can significantly degrade model performance.
Code-based reasoning maintains better performance under constraints compared to natural language.
Hybrid reasoning approaches tend to underperform compared to single modalities.
Model robustness varies significantly, influencing their effectiveness under limited resources.
The findings have implications for deploying reasoning-specialized systems efficiently.

Computer Science > Machine Learning arXiv:2602.14444 (cs) [Submitted on 16 Feb 2026] Title:Broken Chains: The Cost of Incomplete Reasoning in LLMs Authors:Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Sunishchal Dev, Yash More, Maheep Chaudhary View a PDF of the paper titled Broken Chains: The Cost of Incomplete Reasoning in LLMs, by Ian Su and 7 other authors View PDF HTML (experimental) Abstract:Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and Dee...

Read Original Article