[2507.21807] MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation
Summary
MIBoost introduces a novel gradient boosting algorithm for variable selection after multiple imputation, addressing challenges in model selection with missing data.
Why It Matters
This research is significant as it provides a solution to the common problem of missing data in predictive modeling. By enhancing variable selection methods, MIBoost could improve the accuracy of predictions in various fields, making it a valuable tool for statisticians and data scientists dealing with incomplete datasets.
Key Takeaways
- MIBoost offers a unified variable-selection mechanism across multiple imputed datasets.
- The algorithm extends existing methods like LASSO and elastic nets to gradient boosting.
- Simulation studies indicate MIBoost achieves comparable predictive performance to other advanced methods.
- Addressing missing data effectively can enhance model reliability and insights.
- The research contributes to ongoing discussions about optimal model selection techniques.
Statistics > Machine Learning arXiv:2507.21807 (stat) [Submitted on 29 Jul 2025 (v1), last revised 23 Feb 2026 (this version, v5)] Title:MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation Authors:Robert Kuchen View a PDF of the paper titled MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation, by Robert Kuchen View PDF HTML (experimental) Abstract:Statistical learning methods for automated variable selection, such as LASSO, elastic nets, or gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate on how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches modify the regularization methods LASSO and elastic nets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a u...