[2509.25084] Scaling Generalist Data-Analytic Agents
About this article
Abstract page for arXiv paper 2509.25084: Scaling Generalist Data-Analytic Agents
Computer Science > Computation and Language arXiv:2509.25084 (cs) [Submitted on 29 Sep 2025 (v1), last revised 27 Feb 2026 (this version, v2)] Title:Scaling Generalist Data-Analytic Agents Authors:Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen View a PDF of the paper titled Scaling Generalist Data-Analytic Agents, by Shuofei Qiao and 10 other authors View PDF HTML (experimental) Abstract:Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable ...