[2509.25541] Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
About this article
Abstract page for arXiv paper 2509.25541: Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Computer Science > Computer Vision and Pattern Recognition arXiv:2509.25541 (cs) [Submitted on 29 Sep 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play Authors:Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao View a PDF of the paper titled Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play, by Qinsi Wang and 8 other authors View PDF HTML (experimental) Abstract:Although reinforcement learning (RL) has emerged as a promising approach for improving vision-language models (VLMs) and multimodal large language models (MLLMs), current methods rely heavily on manually curated datasets and costly human verification, which limits scalable self-improvement in multimodal systems. To address this challenge, we propose Vision-Zero, a label-free, domain-agnostic multi-agent self-play framework for self-evolving VLMs through competitive visual games generated from arbitrary image inputs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games ...