[2407.14971] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

[2407.14971] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2407.14971: Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Computer Science > Computer Vision and Pattern Recognition arXiv:2407.14971 (cs) [Submitted on 20 Jul 2024 (v1), last revised 7 Apr 2026 (this version, v3)] Title:Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models Authors:Md Zarif Hossain, Ahmed Imteaj View a PDF of the paper titled Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models, by Md Zarif Hossain and 1 other authors View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision-...

Originally published on April 08, 2026. Curated by AI News.

Related Articles

[2602.06869] Uncovering Cross-Objective Interference in Multi-Objective Alignment
Llms

[2602.06869] Uncovering Cross-Objective Interference in Multi-Objective Alignment

Abstract page for arXiv paper 2602.06869: Uncovering Cross-Objective Interference in Multi-Objective Alignment

arXiv - Machine Learning · 3 min ·
[2512.14954] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation
Llms

[2512.14954] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

Abstract page for arXiv paper 2512.14954: Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

arXiv - Machine Learning · 4 min ·
[2603.08022] Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization
Llms

[2603.08022] Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Abstract page for arXiv paper 2603.08022: Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

arXiv - Machine Learning · 4 min ·
[2505.00753] LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
Llms

[2505.00753] LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

Abstract page for arXiv paper 2505.00753: LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

arXiv - Machine Learning · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime