[2602.03098] TextME: Bridging Unseen Modalities Through Text Descriptions
Summary
The paper introduces TextME, a framework that enables zero-shot cross-modal transfer using only text descriptions, addressing the limitations of paired datasets in multimodal representation learning.
Why It Matters
TextME presents a significant advancement in machine learning by allowing for the expansion of multimodal representations without the need for costly paired datasets. This is particularly relevant in fields like medical imaging and molecular analysis, where expert annotations are scarce. The framework's ability to facilitate cross-modal retrieval enhances its applicability across various domains, potentially transforming how we approach multimodal learning.
Key Takeaways
- TextME enables modality expansion using only text descriptions.
- The framework allows zero-shot cross-modal transfer, enhancing flexibility.
- Empirical validation shows substantial performance retention without paired supervision.
- Text-only training can facilitate emergent retrieval between unaligned modalities.
- This approach is a practical alternative to traditional paired dataset methods.
Computer Science > Machine Learning arXiv:2602.03098 (cs) [Submitted on 3 Feb 2026 (v1), last revised 23 Feb 2026 (this version, v2)] Title:TextME: Bridging Unseen Modalities Through Text Descriptions Authors:Soyeon Hong, Jinchan Kim, Jaegook You, Seungtaek Choi, Suha Kwak, Hyunsouk Cho View a PDF of the paper titled TextME: Bridging Unseen Modalities Through Text Descriptions, by Soyeon Hong and 5 other authors View PDF HTML (experimental) Abstract:Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text-image, text-audio, text-3D, text-molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to...