[2506.14202] DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
Summary
The paper introduces DiffusionBlocks, a framework for block-wise training of neural networks that reduces memory bottlenecks while maintaining competitive performance with end-to-end training.
Why It Matters
As neural networks grow in complexity, memory limitations pose significant challenges for training. DiffusionBlocks offers a scalable solution that enables independent training of network blocks, enhancing efficiency and performance across various architectures, which is crucial for advancing machine learning applications.
Key Takeaways
- DiffusionBlocks allows independent training of neural network blocks.
- The framework reduces memory requirements proportional to the number of blocks.
- It maintains performance comparable to end-to-end training methods.
- Applicable to various transformer architectures beyond classification tasks.
- The approach is theoretically grounded and supports modern generative tasks.
Computer Science > Machine Learning arXiv:2506.14202 (cs) [Submitted on 17 Jun 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation Authors:Makoto Shing, Masanori Koyama, Takuya Akiba View a PDF of the paper titled DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation, by Makoto Shing and Masanori Koyama and Takuya Akiba View PDF HTML (experimental) Abstract:End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range ...