[2604.08557] Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
About this article
Abstract page for arXiv paper 2604.08557: Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Computer Science > Computation and Language arXiv:2604.08557 (cs) [Submitted on 17 Mar 2026] Title:Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models Authors:Arth Singh View a PDF of the paper titled Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models, by Arth Singh View PDF HTML (experimental) Abstract:Diffusion-based language models (dLLMs) generate text by iteratively denoising masked token sequences. We show that their safety alignment rests on a single fragile assumption: that the denoising schedule is monotonic and committed tokens are never re-evaluated. Safety-aligned dLLMs commit refusal tokens within the first 8-16 of 64 denoising steps, and the schedule treats these commitments as permanent. A trivial two-step intervention - re-masking these tokens and injecting a 12-token affirmative prefix - achieves 76.1% ASR on HarmBench (n=159, Lg=128) against LLaDA-8B-Instruct and 81.8% ASR (n=159) against Dream-7B-Instruct, without any gradient computation or adversarial search. The simplicity of this exploit is itself the central finding: augmenting with gradient-optimized perturbation via a differentiable Gumbel-softmax chain consistently degrades ASR (e.g., 41.5% vs. 76.1% at Lg=128), confirming that the vulnerability is structural rather than requiring sophisticated exploitation. These findings reveal that dLLM safety is not adversarially robust but architecturally shallow - it holds only because...