[2604.00770] Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
About this article
Abstract page for arXiv paper 2604.00770: Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
Computer Science > Machine Learning arXiv:2604.00770 (cs) [Submitted on 1 Apr 2026] Title:Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning Authors:Swapnil Parekh View a PDF of the paper titled Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning, by Swapnil Parekh View PDF HTML (experimental) Abstract:A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox eme...