[2602.17395] SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery
Summary
The paper presents SpectralGCD, a novel approach for Generalized Category Discovery (GCD) that enhances multimodal learning by efficiently integrating image and text data, improving accuracy while reducing computational costs.
Why It Matters
As the demand for automated category discovery in unlabeled datasets grows, SpectralGCD offers a significant advancement by addressing the limitations of existing methods. Its efficient cross-modal representation learning can lead to broader applications in machine learning and artificial intelligence, particularly in environments with limited labeled data.
Key Takeaways
- SpectralGCD improves Generalized Category Discovery by using cross-modal representations.
- The method reduces reliance on spurious visual cues through semantic concept mixtures.
- It achieves competitive accuracy with lower computational costs compared to state-of-the-art methods.
- The approach utilizes knowledge distillation to enhance the quality of learned representations.
- Code for SpectralGCD is publicly available, promoting further research and application.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.17395 (cs) [Submitted on 19 Feb 2026] Title:SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery Authors:Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov View a PDF of the paper titled SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery, by Lorenzo Caselli and 3 other authors View PDF HTML (experimental) Abstract:Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed s...