[2510.03272] Where to Add PDE Diffusion in Transformers
Summary
This paper investigates the optimal placement of PDE diffusion layers in transformer architectures, revealing that their insertion order relative to attention mechanisms significantly impacts model performance.
Why It Matters
Understanding where to integrate PDE diffusion in transformers is crucial for enhancing model accuracy and efficiency. This research provides a theoretical framework that can guide future developments in hybrid architectures, potentially leading to more effective machine learning models.
Key Takeaways
- Diffusion and attention in transformers do not commute, affecting performance based on their insertion order.
- Early diffusion improves accuracy by 4.1 percentage points when placed after embedding.
- Post-attention diffusion can degrade performance by 2.5 percentage points.
- A multi-scale diffusion variant shows consistent performance gains.
- The study offers a framework for analyzing local-global compositions in sequence models.
Computer Science > Machine Learning arXiv:2510.03272 (cs) [Submitted on 27 Sep 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Where to Add PDE Diffusion in Transformers Authors:Yukun Zhang, Xueqing Zhou View a PDF of the paper titled Where to Add PDE Diffusion in Transformers, by Yukun Zhang and 1 other authors View PDF HTML (experimental) Abstract:Transformers enable powerful content-based global routing via self-attention, but they lack an explicit local geometric prior along the sequence axis. As a result, the placement of locality-inducing modules in hybrid architectures has largely been empirical. We study a simple deterministic PDE diffusion layer implemented as one explicit Euler step of one-dimensional heat smoothing using a discrete Neumann Laplacian under a spectral stability constraint, and ask a structural question: where should diffusion be inserted relative to attention? Our central claim is that diffusion and attention generally do not commute, so inserting the same local operator before versus after attention leads to qualitatively different behaviors. We develop a three-layer operator-theoretic framework that (1) establishes unconditional guarantees for the diffusion subsystem, including spectral non-expansiveness and monotone Dirichlet-energy dissipation when the diffusion step size is smaller than one half, (2) derives compositional perturbation bounds linking insertion effects to representation roughness and downstream amplification, and ...