[2603.26164] DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
About this article
Abstract page for arXiv paper 2603.26164: DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Computer Science > Machine Learning arXiv:2603.26164 (cs) [Submitted on 27 Mar 2026] Title:DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models Authors:Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang View a PDF of the paper titled DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models, by Hao Liang and 24 other authors View PDF HTML (experimental) Abstract:Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides ext...