[2602.20089] StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Summary
The paper presents StructXLIP, a novel approach that enhances vision-language models by integrating multimodal structural cues, improving cross-modal retrieval performance.
Why It Matters
StructXLIP addresses a critical gap in vision-language alignment by leveraging structural cues, which can significantly enhance the accuracy of models in interpreting and retrieving information across modalities. This advancement is particularly relevant for applications requiring detailed understanding of visual content, such as image captioning and retrieval systems.
Key Takeaways
- StructXLIP improves vision-language model performance by integrating structural cues.
- The method enhances cross-modal retrieval through a structure-centric fine-tuning approach.
- It introduces three new structure-centric losses to optimize alignment between visual and textual data.
- The approach outperforms existing models in both general and specialized domains.
- Code and pretrained models are publicly available for further research and application.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20089 (cs) [Submitted on 23 Feb 2026] Title:StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues Authors:Zanxi Ruan, Qiuyu Kong, Songqun Gao, Yiming Wang, Marco Cristani View a PDF of the paper titled StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues, by Zanxi Ruan and 4 other authors View PDF Abstract:Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes t...