[2510.19675] Study of Training Dynamics for Memory-Constrained Fine-Tuning
Summary
This study presents TraDy, a novel transfer learning scheme for memory-constrained fine-tuning of deep neural networks, achieving state-of-the-art performance while maintaining strict resource limits.
Why It Matters
As deep learning models grow in size, efficient training methods become crucial for deployment in resource-limited environments. This research offers innovative solutions to enhance training dynamics, making advanced AI more accessible and practical.
Key Takeaways
- TraDy leverages layer importance and dynamic stochastic channel selection for efficient training.
- Achieves up to 99% activation sparsity and significant reductions in computational requirements.
- Demonstrates state-of-the-art performance across various tasks and architectures.
Computer Science > Machine Learning arXiv:2510.19675 (cs) [Submitted on 22 Oct 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Study of Training Dynamics for Memory-Constrained Fine-Tuning Authors:Aël Quélennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione View a PDF of the paper titled Study of Training Dynamics for Memory-Constrained Fine-Tuning, by A\"el Qu\'elennec and 4 other authors View PDF HTML (experimental) Abstract:Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.19675 [cs.LG] (or arXiv:2510.19675v2 [cs.LG] for this ...