[2602.21462] Effects of Training Data Quality on Classifier Performance
Summary
This paper investigates how the quality of training data affects the performance of various classifiers, particularly in metagenomic assembly, revealing critical insights into classifier behavior under data degradation.
Why It Matters
Understanding the impact of training data quality is crucial for improving machine learning models, especially in fields like genomics where data integrity can significantly influence outcomes. This research highlights the risks associated with poor data quality and offers insights into classifier congruence, which can guide future model training and evaluation.
Key Takeaways
- Classifier performance is significantly affected by training data quality.
- Degradation of training data leads to breakdown-like behavior across classifiers.
- Congruence among classifiers increases as data quality decreases.
- Spatial heterogeneity in data affects classifier decision-making.
- Insights can inform better practices in training data selection and model evaluation.
Computer Science > Machine Learning arXiv:2602.21462 (cs) [Submitted on 25 Feb 2026] Title:Effects of Training Data Quality on Classifier Performance Authors:Alan F. Karr, Regina Ruane View a PDF of the paper titled Effects of Training Data Quality on Classifier Performance, by Alan F. Karr and Regina Ruane View PDF HTML (experimental) Abstract:We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases. Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML) Cite as: arXiv:2602.21462 [cs.LG] (or arXiv:2602.21462v1 [cs.LG] for this...