[2602.16061] Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models
Summary
This paper presents a novel framework for partial identification of population quantities under missing data, utilizing weak shadow variables from pretrained models to improve estimation accuracy.
Why It Matters
Understanding how to estimate outcomes accurately in the presence of missing data is crucial for researchers and practitioners in social sciences and machine learning. This study introduces a method that leverages pretrained models to enhance estimation, potentially transforming approaches to data analysis in various fields.
Key Takeaways
- Introduces a framework for partial identification using weak shadow variables.
- Demonstrates how pretrained models can tighten estimation bounds effectively.
- Provides a set-expansion estimator for valid coverage in identified sets.
- Shows significant reduction in identification intervals using LLM predictions.
- Addresses challenges of missing not at random (MNAR) data in practical scenarios.
Statistics > Machine Learning arXiv:2602.16061 (stat) [Submitted on 17 Feb 2026] Title:Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models Authors:Hongyu Chen, David Simchi-Levi, Ruoxuan Xiong View a PDF of the paper titled Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models, by Hongyu Chen and 2 other authors View PDF HTML (experimental) Abstract:Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical ...