[2602.13093] Consistency of Large Reasoning Models Under Multi-Turn Attacks
Summary
This article evaluates the robustness of large reasoning models against multi-turn adversarial attacks, revealing vulnerabilities and proposing insights for improving defenses.
Why It Matters
Understanding the limitations of large reasoning models under adversarial conditions is crucial for developing more resilient AI systems. This research highlights specific failure modes and challenges the assumption that reasoning capabilities inherently provide robustness, which is vital for AI safety and reliability in real-world applications.
Key Takeaways
- Large reasoning models show incomplete robustness under multi-turn attacks.
- Five distinct failure modes were identified, with Self-Doubt and Social Conformity being the most prevalent.
- Current confidence-based defenses need redesign to be effective for reasoning models.
Computer Science > Artificial Intelligence arXiv:2602.13093 (cs) [Submitted on 13 Feb 2026] Title:Consistency of Large Reasoning Models Under Multi-Turn Attacks Authors:Yubo Li, Ramayya Krishnan, Rema Padman View a PDF of the paper titled Consistency of Large Reasoning Models Under Multi-Turn Attacks, by Yubo Li and 2 other authors View PDF HTML (experimental) Abstract:Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatic...