[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget
Summary
This paper explores optimal data collection strategies from biased and costly sources, focusing on maximizing effective sample size under budget constraints in machine learning.
Why It Matters
Understanding how to effectively collect and utilize data from diverse sources is crucial in fields like healthcare and social sciences, where data quality and cost are significant factors. This research provides a framework for improving data collection strategies, which can lead to more accurate models and better decision-making.
Key Takeaways
- Naive data collection strategies can lead to suboptimal results.
- The proposed sampling plan maximizes effective sample size considering budget constraints.
- The approach achieves minimax optimal risk, enhancing prediction accuracy.
- Post-stratification estimators can be effectively paired with the new sampling plan.
- The techniques are applicable to various multi-source learning scenarios.
Statistics > Machine Learning arXiv:2602.17894 (stat) [Submitted on 19 Feb 2026] Title:Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget Authors:Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy View a PDF of the paper titled Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget, by Michael O. Harding and 2 other authors View PDF HTML (experimental) Abstract:Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{\chi^2}...