[2602.14760] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers
Summary
This article explores a structural misalignment in Transformers, particularly regarding residual connections and their impact on next-token prediction in large language models (LLMs). The authors propose solutions to mitigate this issue, enhancing model performance.
Why It Matters
Understanding the misalignment between residual connections and token prediction is crucial for improving the efficiency and accuracy of LLMs. This research offers insights that could lead to better architectural designs in AI, enhancing the capabilities of autoregressive Transformers.
Key Takeaways
- Residual connections in Transformers may misalign with next-token prediction, leading to inefficiencies.
- The study empirically identifies shifts in input-output alignment within pretrained LLMs.
- Proposed solutions include residual attenuation methods that improve model performance across benchmarks.
Computer Science > Computation and Language arXiv:2602.14760 (cs) [Submitted on 16 Feb 2026] Title:Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers Authors:Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene View a PDF of the paper titled Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers, by Jonathan Lys and 5 other authors View PDF Abstract:Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment a...