Machine Learning Ai Safety Data Science Ai Agents

[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

arXiv - Machine Learning February 23, 2026 4 min read Article

Summary

This paper explores optimal data collection strategies from biased and costly sources, focusing on maximizing effective sample size under budget constraints in machine learning.

Why It Matters

Understanding how to effectively collect and utilize data from diverse sources is crucial in fields like healthcare and social sciences, where data quality and cost are significant factors. This research provides a framework for improving data collection strategies, which can lead to more accurate models and better decision-making.

Key Takeaways

Naive data collection strategies can lead to suboptimal results.
The proposed sampling plan maximizes effective sample size considering budget constraints.
The approach achieves minimax optimal risk, enhancing prediction accuracy.
Post-stratification estimators can be effectively paired with the new sampling plan.
The techniques are applicable to various multi-source learning scenarios.

Statistics > Machine Learning arXiv:2602.17894 (stat) [Submitted on 19 Feb 2026] Title:Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget Authors:Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy View a PDF of the paper titled Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget, by Michael O. Harding and 2 other authors View PDF HTML (experimental) Abstract:Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{\chi^2}...

Read Original Article

[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Summary

Why It Matters

Key Takeaways

Related Articles

[D] ICML Rebuttal Question

[D] ML researcher looking to switch to a product company.

Building behavioural response models of public figures using Brain scan data (Predict their next move using psychological modelling) [P]

[P] bitnet-edge: Ternary-weight CNNs ({-1,0,+1}) on MNIST and CIFAR-10, deployed to ESP32-S3 with zero multiplications

No comments

Stay updated with AI News