[2602.22586] TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Summary
The paper presents TabDLM, a novel framework for generating free-form tabular data using joint numerical-language diffusion, addressing challenges in existing methods.
Why It Matters
As synthetic tabular data generation becomes crucial for applications like data augmentation and privacy, TabDLM offers a significant advancement by effectively combining numerical and textual data generation, enhancing the quality and utility of generated datasets.
Key Takeaways
- TabDLM integrates numerical and language data generation in a unified model.
- The framework utilizes masked diffusion for text and continuous diffusion for numerical features.
- Extensive experiments show TabDLM outperforms existing methods in generating high-quality tabular data.
Computer Science > Machine Learning arXiv:2602.22586 (cs) [Submitted on 26 Feb 2026] Title:TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion Authors:Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen, Muhan Zhang View a PDF of the paper titled TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion, by Donghong Cai and 5 other authors View PDF HTML (experimental) Abstract:Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical-...