[2602.16125] On the Power of Source Screening for Learning Shared Feature Extractors
Summary
This paper explores the effectiveness of source screening in learning shared feature extractors, demonstrating that optimal subspace estimation can be achieved by selectively training on relevant data sources.
Why It Matters
Understanding which data sources to include in machine learning models is crucial for enhancing representation learning. This research provides insights into optimizing data selection, which can improve model performance and efficiency, particularly in scenarios with heterogeneous data sources.
Key Takeaways
- Source screening can significantly enhance the learning of shared feature extractors.
- Carefully selected subsets of data sources can achieve minimax optimality.
- The study formalizes the concept of informative subpopulations for better data selection.
- Algorithms and heuristics are developed for identifying effective data subsets.
- Empirical evaluations validate the proposed methods on both synthetic and real-world datasets.
Computer Science > Machine Learning arXiv:2602.16125 (cs) [Submitted on 18 Feb 2026] Title:On the Power of Source Screening for Learning Shared Feature Extractors Authors:Leo (Muxing)Wang, Connor Mclaughlin, Lili Su View a PDF of the paper titled On the Power of Source Screening for Learning Shared Feature Extractors, by Leo (Muxing) Wang and 2 other authors View PDF HTML (experimental) Abstract:Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We f...