[2602.13093] Consistency of Large Reasoning Models Under Multi-Turn Attacks

[2602.13093] Consistency of Large Reasoning Models Under Multi-Turn Attacks

arXiv - AI 3 min read Article

Summary

This article evaluates the robustness of large reasoning models against multi-turn adversarial attacks, revealing vulnerabilities and proposing insights for improving defenses.

Why It Matters

Understanding the limitations of large reasoning models under adversarial conditions is crucial for developing more resilient AI systems. This research highlights specific failure modes and challenges the assumption that reasoning capabilities inherently provide robustness, which is vital for AI safety and reliability in real-world applications.

Key Takeaways

  • Large reasoning models show incomplete robustness under multi-turn attacks.
  • Five distinct failure modes were identified, with Self-Doubt and Social Conformity being the most prevalent.
  • Current confidence-based defenses need redesign to be effective for reasoning models.

Computer Science > Artificial Intelligence arXiv:2602.13093 (cs) [Submitted on 13 Feb 2026] Title:Consistency of Large Reasoning Models Under Multi-Turn Attacks Authors:Yubo Li, Ramayya Krishnan, Rema Padman View a PDF of the paper titled Consistency of Large Reasoning Models Under Multi-Turn Attacks, by Yubo Li and 2 other authors View PDF HTML (experimental) Abstract:Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatic...

Related Articles

Llms

Looking to build a production-level AI/ML project (agentic systems), need guidance on what to build

Hi everyone, I’m a final-year undergraduate AI/ML student currently focusing on applied AI / agentic systems. So far, I’ve spent time und...

Reddit - ML Jobs · 1 min ·
Meta is reentering the AI race with a new model called Muse Spark | The Verge
Machine Learning

Meta is reentering the AI race with a new model called Muse Spark | The Verge

Meta Superintelligence Labs has unveiled a new AI model called Muse Spark that will soon roll out across apps like Instagram and Facebook.

The Verge - AI · 5 min ·
Llms

[P] Building a LLM from scratch with Mary Shelley's "Frankenstein" (on Kaggle)

Notebook on GitHub: https://github.com/Buzzpy/Python-Machine-Learning-Models/blob/main/Frankenstein/train-frankenstein.ipynb submitted by...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] How are reviewers able to get away without providing acknowledgement in ICML 2026?

Today officially marks the end of the author-reviewer discussion period. The acknowledgement deadline has already passed by over 3 days a...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime