[2510.00037] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

[2510.00037] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

arXiv - AI 4 min read Article

Summary

This paper evaluates the robustness of Vision-Language-Action (VLA) models against various multi-modal perturbations, proposing a new method, RobustVLA, that enhances performance across multiple modalities.

Why It Matters

As VLA models are increasingly deployed in real-world applications, understanding their robustness to diverse perturbations is crucial. This research addresses gaps in existing methodologies, providing insights that could improve the reliability of AI systems in dynamic environments.

Key Takeaways

  • Robustness in VLA models is critical for real-world applications.
  • Actions are identified as the most vulnerable modality in VLA systems.
  • The proposed RobustVLA method significantly outperforms existing models in robustness and inference speed.

Computer Science > Computer Vision and Pattern Recognition arXiv:2510.00037 (cs) [Submitted on 26 Sep 2025 (v1), last revised 24 Feb 2026 (this version, v4)] Title:On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations Authors:Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Weifeng Lv, Simin Li View a PDF of the paper titled On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations, by Jianing Guo and 15 other authors View PDF HTML (experimental) Abstract:In Vision-Language-Actionf(VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Using machine learning to identify individuals at risk for intimate partner violence
Machine Learning

Using machine learning to identify individuals at risk for intimate partner violence

Researchers at Mass General Brigham have developed a series of artificial intelligence (AI) tools that uses machine learning to identify ...

AI News - General · 7 min ·
Accelerating science with AI and simulations
Machine Learning

Accelerating science with AI and simulations

MIT Professor Rafael Gómez-Bombarelli discusses the transformative potential of AI in scientific research, emphasizing its role in materi...

AI News - General · 10 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime