[2602.21462] Effects of Training Data Quality on Classifier Performance

[2602.21462] Effects of Training Data Quality on Classifier Performance

arXiv - Machine Learning 3 min read Article

Summary

This paper investigates how the quality of training data affects the performance of various classifiers, particularly in metagenomic assembly, revealing critical insights into classifier behavior under data degradation.

Why It Matters

Understanding the impact of training data quality is crucial for improving machine learning models, especially in fields like genomics where data integrity can significantly influence outcomes. This research highlights the risks associated with poor data quality and offers insights into classifier congruence, which can guide future model training and evaluation.

Key Takeaways

  • Classifier performance is significantly affected by training data quality.
  • Degradation of training data leads to breakdown-like behavior across classifiers.
  • Congruence among classifiers increases as data quality decreases.
  • Spatial heterogeneity in data affects classifier decision-making.
  • Insights can inform better practices in training data selection and model evaluation.

Computer Science > Machine Learning arXiv:2602.21462 (cs) [Submitted on 25 Feb 2026] Title:Effects of Training Data Quality on Classifier Performance Authors:Alan F. Karr, Regina Ruane View a PDF of the paper titled Effects of Training Data Quality on Classifier Performance, by Alan F. Karr and Regina Ruane View PDF HTML (experimental) Abstract:We describe extensive numerical experiments assessing and quantifying how classifier performance depends on the quality of the training data, a frequently neglected component of the analysis of classifiers. More specifically, in the scientific context of metagenomic assembly of short DNA reads into "contigs," we examine the effects of degrading the quality of the training data by multiple mechanisms, and for four classifiers -- Bayes classifiers, neural nets, partition models and random forests. We investigate both individual behavior and congruence among the classifiers. We find breakdown-like behavior that holds for all four classifiers, as degradation increases and they move from being mostly correct to only coincidentally correct, because they are wrong in the same way. In the process, a picture of spatial heterogeneity emerges: as the training data move farther from analysis data, classifier decisions degenerate, the boundary becomes less dense, and congruence increases. Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML) Cite as: arXiv:2602.21462 [cs.LG]   (or arXiv:2602.21462v1 [cs.LG] for this...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Llms

LLM agents can trigger real actions now. But what actually stops them from executing?

We ran into a simple but important issue while building agents with tool calling: the model can propose actions but nothing actually enfo...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

OkCupid gave 3 million dating-app photos to facial recognition firm, FTC says

submitted by /u/Mathemodel [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime