[2603.29791] Reasoning-Driven Synthetic Data Generation and Evaluation
About this article
Abstract page for arXiv paper 2603.29791: Reasoning-Driven Synthetic Data Generation and Evaluation
Computer Science > Artificial Intelligence arXiv:2603.29791 (cs) [Submitted on 31 Mar 2026] Title:Reasoning-Driven Synthetic Data Generation and Evaluation Authors:Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous View a PDF of the paper titled Reasoning-Driven Synthetic Data Generation and Evaluation, by Tim R. Davidson and 4 other authors View PDF HTML (experimental) Abstract:Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism de...