[2508.11810] FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

[2508.11810] FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

arXiv - Machine Learning 3 min read Article

Summary

FairTabGen introduces a novel framework for generating high-fidelity synthetic healthcare data from limited samples, enhancing fairness and predictive utility.

Why It Matters

This research addresses critical challenges in healthcare data generation, particularly under privacy constraints. By improving the quality and fairness of synthetic data, it has the potential to enhance clinical research and AI applications in healthcare, making it a significant contribution to the field.

Key Takeaways

  • FairTabGen generates high-quality synthetic healthcare data using only a small subset of original data.
  • The framework improves fairness by 50% while maintaining predictive utility.
  • Bias mitigation algorithms enhance demographic parity in generated data.
  • The method requires significantly less data (99% reduction) compared to traditional approaches.
  • FairTabGen addresses privacy and regulatory challenges in healthcare data usage.

Computer Science > Machine Learning arXiv:2508.11810 (cs) [Submitted on 15 Aug 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples Authors:Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani View a PDF of the paper titled FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples, by Nitish Nagesh and 6 other authors View PDF HTML (experimental) Abstract:Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in t...

Related Articles

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center
Llms

Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center

AI Tools & Products · 5 min ·
Llms

How To Use Claude AI In 2026 - Full Tutorial In Hindi Full Write-up (QcKiaUE9n8)

AI Tools & Products · 1 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime