[2603.03332] Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
About this article
Abstract page for arXiv paper 2603.03332: Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Computer Science > Computation and Language arXiv:2603.03332 (cs) [Submitted on 11 Feb 2026] Title:Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations Authors:Ashwath Vaithinathan Aravindan, Mayank Kejriwal View a PDF of the paper titled Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations, by Ashwath Vaithinathan Aravindan and 1 other authors View PDF HTML (experimental) Abstract:Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30\% loss even for largest models); ExtraSteps incur minimal acc...