[2510.03272] Where to Add PDE Diffusion in Transformers

[2510.03272] Where to Add PDE Diffusion in Transformers

arXiv - AI 4 min read Article

Summary

This paper investigates the optimal placement of PDE diffusion layers in transformer architectures, revealing that their insertion order relative to attention mechanisms significantly impacts model performance.

Why It Matters

Understanding where to integrate PDE diffusion in transformers is crucial for enhancing model accuracy and efficiency. This research provides a theoretical framework that can guide future developments in hybrid architectures, potentially leading to more effective machine learning models.

Key Takeaways

  • Diffusion and attention in transformers do not commute, affecting performance based on their insertion order.
  • Early diffusion improves accuracy by 4.1 percentage points when placed after embedding.
  • Post-attention diffusion can degrade performance by 2.5 percentage points.
  • A multi-scale diffusion variant shows consistent performance gains.
  • The study offers a framework for analyzing local-global compositions in sequence models.

Computer Science > Machine Learning arXiv:2510.03272 (cs) [Submitted on 27 Sep 2025 (v1), last revised 16 Feb 2026 (this version, v3)] Title:Where to Add PDE Diffusion in Transformers Authors:Yukun Zhang, Xueqing Zhou View a PDF of the paper titled Where to Add PDE Diffusion in Transformers, by Yukun Zhang and 1 other authors View PDF HTML (experimental) Abstract:Transformers enable powerful content-based global routing via self-attention, but they lack an explicit local geometric prior along the sequence axis. As a result, the placement of locality-inducing modules in hybrid architectures has largely been empirical. We study a simple deterministic PDE diffusion layer implemented as one explicit Euler step of one-dimensional heat smoothing using a discrete Neumann Laplacian under a spectral stability constraint, and ask a structural question: where should diffusion be inserted relative to attention? Our central claim is that diffusion and attention generally do not commute, so inserting the same local operator before versus after attention leads to qualitatively different behaviors. We develop a three-layer operator-theoretic framework that (1) establishes unconditional guarantees for the diffusion subsystem, including spectral non-expansiveness and monotone Dirichlet-energy dissipation when the diffusion step size is smaller than one half, (2) derives compositional perturbation bounds linking insertion effects to representation roughness and downstream amplification, and ...

Related Articles

[2604.01989] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Llms

[2604.01989] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Abstract page for arXiv paper 2604.01989: Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

arXiv - AI · 4 min ·
[2604.01447] Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars
Machine Learning

[2604.01447] Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars

Abstract page for arXiv paper 2604.01447: Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars

arXiv - AI · 3 min ·
[2603.24326] Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Llms

[2603.24326] Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

Abstract page for arXiv paper 2603.24326: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

arXiv - AI · 4 min ·
[2603.18545] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
Llms

[2603.18545] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Abstract page for arXiv paper 2603.18545: CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Visio...

arXiv - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime