[2505.22554] A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning
Summary
This article presents a novel copula-based supervised filter for feature selection in diabetes risk prediction, demonstrating improved efficiency and interpretability in machine learning models.
Why It Matters
Effective feature selection is crucial in medical predictive modeling, particularly for diabetes risk. This study introduces a method that enhances the identification of significant predictors, especially in extreme patient strata, which can lead to better clinical outcomes and more accurate risk assessments.
Key Takeaways
- Introduces a copula-based method for feature selection in diabetes risk prediction.
- Demonstrates improved efficiency by reducing features while maintaining predictive power.
- Highlights the importance of focusing on predictors in the distribution tails for better model performance.
- Compares favorably against standard feature selection methods like Mutual Information and ReliefF.
- Provides a clinically coherent approach that can complement existing methods in public health.
Statistics > Machine Learning arXiv:2505.22554 (stat) [Submitted on 28 May 2025 (v1), last revised 24 Feb 2026 (this version, v5)] Title:A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning Authors:Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux View a PDF of the paper titled A Copula Based Supervised Filter for Feature Selection in Diabetes Risk Prediction Using Machine Learning, by Agnideep Aich and 2 other authors View PDF HTML (experimental) Abstract:Effective feature selection is critical for robust and interpretable predictive modeling in medicine, especially when risk factors matter most in extreme patient strata. Many standard selectors emphasize average associations and can miss predictors whose relevance is concentrated in the distribution tails. We propose a computationally efficient supervised filter based on a Gumbel-copula implied upper-tail concordance score (lambda U), defined as a monotone transformation of Kendall's tau, to rank features by their tendency to be simultaneously extreme with the positive class. We compare against four common baselines (Mutual Information, mRMR, ReliefF, and L1/Elastic-Net) across four classifiers on two diabetes datasets: a large-scale public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Analyses include statistical testing, permutation importance, and robustness checks. On CDC, the proposed selector is the fastest and reduces 21 f...