[2602.15368] GMAIL: Generative Modality Alignment for generated Image Learning
Summary
The paper presents GMAIL, a novel framework for aligning generated images with real images in machine learning, enhancing performance in various vision-language tasks.
Why It Matters
As generative models become increasingly prevalent, understanding how to effectively integrate generated images into training datasets is crucial. GMAIL addresses the challenges posed by modality discrepancies, potentially improving the robustness and accuracy of machine learning models across multiple applications.
Key Takeaways
- GMAIL treats generated images as a distinct modality from real images.
- The framework employs a multi-modal learning approach to align these modalities effectively.
- Significant improvements in tasks such as image captioning and zero-shot image retrieval were observed.
- GMAIL can be integrated with various vision-language models, enhancing their performance.
- The approach shows positive trends in scaling generated data for improved model training.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15368 (cs) [Submitted on 17 Feb 2026] Title:GMAIL: Generative Modality Alignment for generated Image Learning Authors:Shentong Mo, Sukmin Yun View a PDF of the paper titled GMAIL: Generative Modality Alignment for generated Image Learning, by Shentong Mo and 1 other authors View PDF HTML (experimental) Abstract:Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative model...