[2602.13684] On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling
Summary
This paper explores the sparsifiability of correlation clustering, providing approximation guarantees under edge sampling, and establishes a structural dichotomy between pseudometric and general weighted instances.
Why It Matters
Understanding the sparsifiability of correlation clustering is crucial for improving the efficiency of unsupervised learning algorithms. The findings could enhance the scalability of clustering methods, making them more applicable in real-world scenarios where data is often incomplete.
Key Takeaways
- Correlation clustering's LP-based approximation guarantees require extensive triangle inequality constraints.
- The study reveals a trade-off between edge information and approximation quality in clustering.
- A robust approximation can be achieved with a specific number of observed edges, highlighting the importance of pseudometric structure.
Computer Science > Machine Learning arXiv:2602.13684 (cs) [Submitted on 14 Feb 2026] Title:On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma View a PDF of the paper titled On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling, by Ibne Farabi Shihab and 2 other authors View PDF HTML (experimental) Abstract:Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require $\Theta(n^3)$ triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly $n{-}1$, yielding additive $\varepsilon$-coresets of optimal size $\tilde{O}(n/\varepsilon^2)$; that at most $\binom{n}{2}$ triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust $\frac{10}{3}$-approximation (up to an additive term controlled by an empirically computable imputation-quality statistic $\overline{\Gamma}_w$) once $\til...