[2602.00388] Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
About this article
Abstract page for arXiv paper 2602.00388: Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode
Computer Science > Machine Learning arXiv:2602.00388 (cs) [Submitted on 30 Jan 2026 (v1), last revised 2 Apr 2026 (this version, v2)] Title:Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode Authors:Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu View a PDF of the paper titled Safer by Diffusion, Broken by Context: Diffusion LLM's Safety Blessing and Its Failure Mode, by Zeyuan He and 9 other authors View PDF HTML (experimental) Abstract:Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. Following this analysis, we highlight a simple yet effective failure mode, context nesting, in which harmful requests are embedded within structured benign contexts. Empirically, we show that this simple black-box strategy bypasses D-LLMs' safety blessing, achieving state-of-the-art attack success rates across models and...