[2510.21686] Multimodal Datasets with Controllable Mutual Information
Summary
This paper presents a framework for generating multimodal datasets with controllable mutual information, enhancing the study of mutual information estimators and self-supervised learning techniques.
Why It Matters
The ability to create datasets with known mutual information is crucial for advancing research in machine learning, particularly in self-supervised learning and multimodal data analysis. This framework can lead to improved performance in various applications, including astrophysics and AI research.
Key Takeaways
- Introduces a framework for generating multimodal datasets with calculable mutual information.
- Demonstrates that regression performance improves with higher mutual information between input modalities and target values.
- Provides a testbed for evaluating mutual information estimators and multimodal self-supervised learning techniques.
Statistics > Machine Learning arXiv:2510.21686 (stat) [Submitted on 24 Oct 2025 (v1), last revised 25 Feb 2026 (this version, v2)] Title:Multimodal Datasets with Controllable Mutual Information Authors:Raheem Karim Hashmani, Garrett W. Merz, Helen Qu, Mariel Pettee, Kyle Cranmer View a PDF of the paper titled Multimodal Datasets with Controllable Mutual Information, by Raheem Karim Hashmani and 4 other authors View PDF HTML (experimental) Abstract:We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information (MI) between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning (SSL) techniques. Our framework constructs realistic datasets with known MI using a flow-based generative model and a structured causal framework for generating correlated latent variables. We benchmark a suite of MI estimators on datasets with varying ground truth MI values and verify that regression performance improves as the MI increases between input modalities and the target value. Finally, we describe how our framework can be applied to contexts including multi-detector astrophysics and SSL studies in the highly multimodal regime. Comments: Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2510.21686 [stat.ML] (or arXiv:2510.21686v2 [stat.ML] for this version) https://doi.org/10.485...