[2602.16590] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Summary
This paper presents CLIP-MHAdapter, a novel contrastive learning framework that enhances street-view image classification by using attention-based feature adaptation, achieving state-of-the-art results with low computational cost.
Why It Matters
Street-view image classification is crucial for applications like autonomous driving and urban analytics. This research addresses the limitations of existing models by introducing a lightweight adaptation method that captures fine-grained attributes, thus improving accuracy and efficiency in real-world applications.
Key Takeaways
- CLIP-MHAdapter improves street-view image classification accuracy.
- The model uses multi-head self-attention to capture inter-patch dependencies.
- Achieves competitive results with only 1.4 million trainable parameters.
- Addresses limitations of existing adaptation methods reliant on global embeddings.
- Contributes to advancements in autonomous driving and urban analytics.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.16590 (cs) [Submitted on 18 Feb 2026] Title:A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification Authors:Qi You, Yitai Cheng, Zichao Zeng, James Haworth View a PDF of the paper titled A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification, by Qi You and 3 other authors View PDF HTML (experimental) Abstract:Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight ...