[2602.14914] Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation
Summary
This paper presents a theoretical analysis demonstrating that additive control variates outperform self-normalisation techniques in off-policy evaluation, particularly in ranking and recommendation systems.
Why It Matters
The findings challenge conventional methods in off-policy evaluation, suggesting a shift towards additive control variates for improved performance. This has significant implications for machine learning practitioners focused on optimizing recommendation systems without extensive online testing.
Key Takeaways
- Additive control variates provide superior performance in off-policy evaluation compared to self-normalised methods.
- The paper proves that the β*-IPS estimator asymptotically dominates SNIPS in Mean Squared Error.
- Analytical decomposition of variance gaps supports the transition to optimal baseline corrections.
- The results are crucial for enhancing the efficiency of ranking and recommendation systems.
- Theoretical guarantees for additive methods may lead to broader adoption in practical applications.
Computer Science > Machine Learning arXiv:2602.14914 (cs) [Submitted on 16 Feb 2026] Title:Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation Authors:Olivier Jeunen, Shashank Gupta View a PDF of the paper titled Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation, by Olivier Jeunen and 1 other authors View PDF HTML (experimental) Abstract:Off-policy evaluation (OPE) is essential for assessing ranking and recommendation systems without costly online interventions. Self-Normalised Inverse Propensity Scoring (SNIPS) is a standard tool for variance reduction in OPE, leveraging a multiplicative control variate. Recent advances in off-policy learning suggest that additive control variates (baseline corrections) may offer superior performance, yet theoretical guarantees for evaluation are lacking. This paper provides a definitive answer: we prove that $\beta^\star$-IPS, an estimator with an optimal additive baseline, asymptotically dominates SNIPS in Mean Squared Error. By analytically decomposing the variance gap, we show that SNIPS is asymptotically equivalent to using a specific -- but generally sub-optimal -- additive baseline. Our results theoretically justify shifting from self-normalisation to optimal baseline corrections for both ranking and recommendation. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2602.14914 [cs.LG] (or arXiv:2602.14914v1 [cs.LG] for this version) https://do...