Machine Learning Ai Safety Data Science

[2503.07313] The influence of missing data mechanisms and simple missing data handling techniques on fairness

arXiv - Machine Learning February 23, 2026 4 min read Article

Summary

This article explores how different missing data mechanisms and handling techniques affect the fairness of machine learning algorithms, revealing that listwise deletion generally yields the highest fairness across various classification methods.

Why It Matters

Understanding the impact of missing data on algorithmic fairness is crucial as machine learning systems increasingly influence decision-making in various sectors. This research highlights the importance of selecting appropriate data handling techniques to mitigate bias and enhance fairness in AI applications.

Key Takeaways

Missing data mechanisms can influence the fairness of machine learning algorithms.
Listwise deletion often provides the highest fairness among handling techniques.
Random forests tend to achieve the highest fairness across classification algorithms.
The interaction between data handling techniques and algorithms is significant.
Limited research exists on the implications of missing data on algorithmic fairness.

Statistics > Machine Learning arXiv:2503.07313 (stat) [Submitted on 10 Mar 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:The influence of missing data mechanisms and simple missing data handling techniques on fairness Authors:Aeysha Bhatti, Trudie Sandrock, Johane Nienkemper-Swanepoel View a PDF of the paper titled The influence of missing data mechanisms and simple missing data handling techniques on fairness, by Aeysha Bhatti and 2 other authors View PDF HTML (experimental) Abstract:Machine learning algorithms permeate the day-to-day aspects of our lives and therefore studying the fairness of these algorithms before implementation is crucial. One way in which bias can manifest in a dataset is through missing values. Missing data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use simpler methods of imputation (e.g. mean or mode) compared to more advanced approaches (e.g. multiple imputation). This study considers the fairness of various classification algorithms after a range of missing data handling strategies is applied. Missing values are generated (i.e. amputed) in three popular datasets for classification fairness, by creating a high percentage of missing values ...

Read Original Article