[2603.29211] Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
About this article
Abstract page for arXiv paper 2603.29211: Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
Computer Science > Artificial Intelligence arXiv:2603.29211 (cs) [Submitted on 31 Mar 2026] Title:Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems Authors:Zhiqian Zhang, Xu Zhao, Xiaoqing Xu, Guangdong Liang, Weijia Wang, Xiaolei Lv, Bo Li, Jun Gao View a PDF of the paper titled Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems, by Zhiqian Zhang and 7 other authors View PDF HTML (experimental) Abstract:In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive three-stage pipeline: pre-training, mid-training, and post-training. Ablation studies ...