[2501.15889] Adaptive Width Neural Networks
Summary
The paper introduces Adaptive Width Neural Networks, a novel approach that optimizes the width of neural network layers during training, enhancing performance across various data domains.
Why It Matters
This research addresses the limitations of traditional neural network width selection methods, offering a more efficient way to adaptively manage network complexity. This is particularly relevant in the context of large-scale models where hyperparameter tuning is often impractical due to high costs.
Key Takeaways
- Adaptive Width Neural Networks optimize layer width during training.
- The method allows for dynamic adjustment based on task difficulty.
- It provides a cost-effective way to manage network performance and resource use.
- Applicable across diverse data types including images, text, and graphs.
- Offers a viable alternative to traditional hyperparameter tuning methods.
Computer Science > Machine Learning arXiv:2501.15889 (cs) [Submitted on 27 Jan 2025 (v1), last revised 16 Feb 2026 (this version, v5)] Title:Adaptive Width Neural Networks Authors:Federico Errica, Henrik Christiansen, Viktor Zaverkin, Mathias Niepert, Francesco Alesiani View a PDF of the paper titled Adaptive Width Neural Networks, by Federico Errica and 4 other authors View PDF HTML (experimental) Abstract:For almost 70 years, researchers have typically selected the width of neural networks' layers either manually or through automated hyperparameter tuning methods such as grid search and, more recently, neural architecture search. This paper challenges the status quo by introducing an easy-to-use technique to learn an unbounded width of a neural network's layer during training. The method jointly optimizes the width and the parameters of each layer via standard backpropagation. We apply the technique to a broad range of data domains such as tables, images, text, sequences, and graphs, showing how the width adapts to the task's difficulty. A by product of our width learning approach is the easy truncation of the trained network at virtually zero cost, achieving a smooth trade-off between performance and compute resources. Alternatively, one can dynamically compress the network until performances do not degrade. In light of recent foundation models trained on large datasets, requiring billions of parameters and where hyper-parameter tuning is unfeasible due to huge training...