[2603.04803] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
About this article
Abstract page for arXiv paper 2603.04803: Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.04803 (cs) [Submitted on 5 Mar 2026] Title:Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang View a PDF of the paper titled Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation, by Boyu Han and 6 other authors View PDF HTML (experimental) Abstract:The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. ...