[2602.16177] Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks
Summary
This paper introduces Conjugate Learning Theory, exploring trainability and generalization in deep neural networks through a novel theoretical framework and empirical validation.
Why It Matters
Understanding the mechanisms of trainability and generalization in deep neural networks is crucial for improving model performance and efficiency. This research provides a theoretical foundation that can guide future advancements in machine learning, particularly in optimizing neural network architectures and training processes.
Key Takeaways
- Introduces a framework for understanding practical learnability in neural networks.
- Establishes convergence theorems related to mini-batch stochastic gradient descent.
- Quantifies the impact of model architecture and batch size on optimization.
- Derives bounds on generalization error based on generalized conditional entropy.
- Validates theoretical predictions with extensive empirical experiments.
Statistics > Machine Learning arXiv:2602.16177 (stat) [Submitted on 18 Feb 2026] Title:Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks Authors:Binchuan Qi View a PDF of the paper titled Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks, by Binchuan Qi View PDF HTML (experimental) Abstract:In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former ...