[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

[2602.17894] Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

arXiv - Machine Learning 4 min read Article

Summary

This paper explores optimal data collection strategies from biased and costly sources, focusing on maximizing effective sample size under budget constraints in machine learning.

Why It Matters

Understanding how to effectively collect and utilize data from diverse sources is crucial in fields like healthcare and social sciences, where data quality and cost are significant factors. This research provides a framework for improving data collection strategies, which can lead to more accurate models and better decision-making.

Key Takeaways

  • Naive data collection strategies can lead to suboptimal results.
  • The proposed sampling plan maximizes effective sample size considering budget constraints.
  • The approach achieves minimax optimal risk, enhancing prediction accuracy.
  • Post-stratification estimators can be effectively paired with the new sampling plan.
  • The techniques are applicable to various multi-source learning scenarios.

Statistics > Machine Learning arXiv:2602.17894 (stat) [Submitted on 19 Feb 2026] Title:Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget Authors:Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy View a PDF of the paper titled Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget, by Michael O. Harding and 2 other authors View PDF HTML (experimental) Abstract:Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{\chi^2}...

Related Articles

Machine Learning

[D] ICML Rebuttal Question

I am currently working on my response on the rebuttal acknowledgments for ICML and I doubting how to handle the strawman argument of that...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] ML researcher looking to switch to a product company.

Hey, I am an AI researcher currently working in a deep tech company as a data scientist. Prior to this, I was doing my PhD. My current ro...

Reddit - Machine Learning · 1 min ·
Machine Learning

Building behavioural response models of public figures using Brain scan data (Predict their next move using psychological modelling) [P]

Hey guys, I’m the same creator of Netryx V2, the geolocation tool. I’ve been working on something new called COGNEX. It learns how a pers...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] bitnet-edge: Ternary-weight CNNs ({-1,0,+1}) on MNIST and CIFAR-10, deployed to ESP32-S3 with zero multiplications

I built a pipeline that takes ternary-quantized CNNs from PyTorch training all the way to bare-metal inference on an ESP32-S3 microcontro...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime