[2511.12449] MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
About this article
Abstract page for arXiv paper 2511.12449: MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.12449 (cs) [Submitted on 16 Nov 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding Authors:Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng View a PDF of the paper titled MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding, by Zhanheng Nie and 7 other authors View PDF HTML (experimental) Abstract:Recent Multimodal Large Language Models (MLLMs) have significantly advanced e-commerce product understanding. However, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced MultimOdal representation learning framework for e-commerce prOduct uNderstanding. It comprises: (1) a Modality-driven Mixture-of-Experts (MoE) that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augment...