[2603.01696] Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
About this article
Abstract page for arXiv paper 2603.01696: Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.01696 (cs) [Submitted on 2 Mar 2026] Title:Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning Authors:Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang View a PDF of the paper titled Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning, by Haonan Jia and 8 other authors View PDF HTML (experimental) Abstract:Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes informa...