[2508.11810] FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples
Summary
FairTabGen introduces a novel framework for generating high-fidelity synthetic healthcare data from limited samples, enhancing fairness and predictive utility.
Why It Matters
This research addresses critical challenges in healthcare data generation, particularly under privacy constraints. By improving the quality and fairness of synthetic data, it has the potential to enhance clinical research and AI applications in healthcare, making it a significant contribution to the field.
Key Takeaways
- FairTabGen generates high-quality synthetic healthcare data using only a small subset of original data.
- The framework improves fairness by 50% while maintaining predictive utility.
- Bias mitigation algorithms enhance demographic parity in generated data.
- The method requires significantly less data (99% reduction) compared to traditional approaches.
- FairTabGen addresses privacy and regulatory challenges in healthcare data usage.
Computer Science > Machine Learning arXiv:2508.11810 (cs) [Submitted on 15 Aug 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples Authors:Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani View a PDF of the paper titled FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples, by Nitish Nagesh and 6 other authors View PDF HTML (experimental) Abstract:Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in t...