[2407.14971] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning

[2407.14971] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

arXiv - AI April 08, 2026 4 min read

About this article

Abstract page for arXiv paper 2407.14971: Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Computer Science > Computer Vision and Pattern Recognition arXiv:2407.14971 (cs) [Submitted on 20 Jul 2024 (v1), last revised 7 Apr 2026 (this version, v3)] Title:Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models Authors:Md Zarif Hossain, Ahmed Imteaj View a PDF of the paper titled Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models, by Md Zarif Hossain and 1 other authors View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision-...

Originally published on April 08, 2026. Curated by AI News.

Llms

[2602.06869] Uncovering Cross-Objective Interference in Multi-Objective Alignment

Abstract page for arXiv paper 2602.06869: Uncovering Cross-Objective Interference in Multi-Objective Alignment

arXiv - Machine Learning · 3 min · 18 minutes ago

Llms

[2512.14954] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

Abstract page for arXiv paper 2512.14954: Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

arXiv - Machine Learning · 4 min · 18 minutes ago

Llms

[2603.08022] Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Abstract page for arXiv paper 2603.08022: Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

arXiv - Machine Learning · 4 min · 18 minutes ago

Llms

[2505.00753] LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

Abstract page for arXiv paper 2505.00753: LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

arXiv - Machine Learning · 4 min · 18 minutes ago

[2407.14971] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

About this article

Related Articles

[2602.06869] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[2512.14954] Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

[2603.08022] Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

[2505.00753] LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey

No comments

Stay updated with AI News