[2602.15136] Universal priors: solving empirical Bayes via Bayesian inference and pretraining
Summary
The paper explores how a pretrained transformer can effectively solve empirical Bayes problems by leveraging universal priors, demonstrating strong adaptability to various test distributions through Bayesian inference.
Why It Matters
This research provides theoretical justification for the effectiveness of pretrained models in machine learning, particularly in empirical Bayes scenarios. Understanding how these models adapt to different data distributions is crucial for improving their performance and reliability in practical applications.
Key Takeaways
- Pretrained transformers can solve empirical Bayes problems effectively.
- The existence of universal priors allows for optimal regret bounds in Bayesian inference.
- Posterior contraction is key to the model's adaptability to unknown test distributions.
- The study explains length generalization in pretrained models.
- This research enhances the understanding of model performance in diverse scenarios.
Statistics > Machine Learning arXiv:2602.15136 (stat) [Submitted on 16 Feb 2026] Title:Universal priors: solving empirical Bayes via Bayesian inference and pretraining Authors:Nick Cannella, Anzo Teh, Yanjun Han, Yury Polyanskiy View a PDF of the paper titled Universal priors: solving empirical Bayes via Bayesian inference and pretraining, by Nick Cannella and 3 other authors View PDF HTML (experimental) Abstract:We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the model architecture or training dynamics, we ask why a pretrained Bayes estimator, trained under a prespecified training distribution, can adapt to arbitrary test distributions. Focusing on Poisson EB problems, we identify the existence of universal priors such that training under these priors yields a near-optimal regret bound of $\widetilde{O}(\frac{1}{n})$ uniformly over all test distributions. Our analysis leverages the classical phenomenon of posterior contraction in Bayesian statistics, showing that the pretrained transformer adapts to unknown test distributions precisely through posterior contraction. This perspective also explains the phenomenon of length generalization, in which the test sequence length exceeds the training length, as the model performs Bayesian inference using a generali...