Llms Machine Learning Computer Vision Robotics Data Science

[2602.16590] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

arXiv - Machine Learning February 19, 2026 3 min read Article

Summary

This paper presents CLIP-MHAdapter, a novel contrastive learning framework that enhances street-view image classification by using attention-based feature adaptation, achieving state-of-the-art results with low computational cost.

Why It Matters

Street-view image classification is crucial for applications like autonomous driving and urban analytics. This research addresses the limitations of existing models by introducing a lightweight adaptation method that captures fine-grained attributes, thus improving accuracy and efficiency in real-world applications.

Key Takeaways

CLIP-MHAdapter improves street-view image classification accuracy.
The model uses multi-head self-attention to capture inter-patch dependencies.
Achieves competitive results with only 1.4 million trainable parameters.
Addresses limitations of existing adaptation methods reliant on global embeddings.
Contributes to advancements in autonomous driving and urban analytics.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.16590 (cs) [Submitted on 18 Feb 2026] Title:A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification Authors:Qi You, Yitai Cheng, Zichao Zeng, James Haworth View a PDF of the paper titled A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification, by Qi You and 3 other authors View PDF HTML (experimental) Abstract:Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight ...

Read Original Article

[2602.16590] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Summary

Why It Matters

Key Takeaways

Related Articles

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

World models will be the next big thing, bye-bye LLMs

No comments

Stay updated with AI News