[2602.16111] Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing
Summary
The paper presents a scalable framework for measuring content prevalence in large-scale A/B testing, decoupling expensive labeling from evaluation using surrogate signals.
Why It Matters
As online platforms increasingly rely on A/B testing to optimize user experience, the ability to measure content prevalence efficiently is crucial. This framework allows for quick, cost-effective evaluations without the need for extensive labeling, making it relevant for data-driven decision-making in digital media.
Key Takeaways
- Introduces a surrogate-based prevalence measurement framework for A/B testing.
- Decouples expensive content labeling from the evaluation process.
- Utilizes impression logs to estimate prevalence quickly and efficiently.
- Validates the accuracy of surrogate estimates against reference estimates.
- Facilitates scalable and low-latency prevalence measurement in experimentation.
Statistics > Applications arXiv:2602.16111 (stat) [Submitted on 18 Feb 2026] Title:Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing Authors:Zehao Xu, Tony Paek, Kevin O'Sullivan, Attila Dobi View a PDF of the paper titled Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing, by Zehao Xu and 3 other authors View PDF HTML (experimental) Abstract:Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emph{surrogate-based prevalence measurement} framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emph{score bucketing} as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. A...