[2602.20698] High-Dimensional Robust Mean Estimation with Untrusted Batches
Summary
This paper presents algorithms for high-dimensional mean estimation in collaborative settings where data may come from untrusted sources, addressing challenges posed by adversarial users.
Why It Matters
As data increasingly comes from diverse and potentially malicious sources, understanding robust mean estimation is crucial for ensuring accuracy in machine learning applications. This research provides insights into handling adversarial data, which is vital for developing resilient AI systems.
Key Takeaways
- The study introduces a double corruption model for mean estimation involving adversarial and heterogeneous data sources.
- Two Sum-of-Squares based algorithms are proposed to address the challenges of high-dimensional data corruption.
- The algorithms achieve a minimax-optimal error rate, highlighting the balance between adversarial influence and statistical heterogeneity.
- The research emphasizes the importance of batch structure in mitigating the impact of adversarial users.
- Findings are relevant for applications in AI where data integrity is critical, such as in finance and healthcare.
Computer Science > Machine Learning arXiv:2602.20698 (cs) [Submitted on 24 Feb 2026] Title:High-Dimensional Robust Mean Estimation with Untrusted Batches Authors:Maryam Aliakbarpour, Vladimir Braverman, Yuhan Liu, Junze Yin View a PDF of the paper titled High-Dimensional Robust Mean Estimation with Untrusted Batches, by Maryam Aliakbarpour and 3 other authors View PDF Abstract:We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$. In this environment, a learner seeks to recover the mean $\mu$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$-fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to $P$, but deviate by a proximity parameter $\alpha$. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrt{\alpha}$, or (2) an $\alpha$-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even ...