[2603.17655] Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
About this article
Abstract page for arXiv paper 2603.17655: Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.17655 (cs) [Submitted on 18 Mar 2026 (v1), last revised 22 Mar 2026 (this version, v2)] Title:Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment Authors:Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li View a PDF of the paper titled Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment, by Yaze Zhao and 3 other authors View PDF HTML (experimental) Abstract:Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired b...