[2505.18150] Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning
Summary
The paper introduces Generative Distribution Embeddings (GDE), a novel framework that enhances autoencoders for multiscale representation learning by operating on entire distributions rather than single data points.
Why It Matters
GDEs address the need for models that can reason across multiple scales in real-world problems, particularly in computational biology. By improving representation learning, GDEs can lead to better predictive models and insights in complex biological datasets, making them highly relevant for researchers in machine learning and bioinformatics.
Key Takeaways
- Generative Distribution Embeddings (GDE) enhance traditional autoencoders by focusing on distributions.
- GDEs demonstrate superior performance on synthetic datasets compared to existing methods.
- The framework is applicable to various computational biology challenges, including RNA sequencing and synthetic promoter design.
Computer Science > Machine Learning arXiv:2505.18150 (cs) [Submitted on 23 May 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning Authors:Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, Omar Abudayyeh View a PDF of the paper titled Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning, by Nic Fishman and 4 other authors View PDF HTML (experimental) Abstract:Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark ...