[2410.12476] Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation
About this article
Abstract page for arXiv paper 2410.12476: Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation
Computer Science > Computation and Language arXiv:2410.12476 (cs) [Submitted on 16 Oct 2024 (v1), last revised 25 Mar 2026 (this version, v3)] Title:Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation Authors:Zerui Xu, Fang Wu, Yingzhou Lu, Yuanyuan Zhang, Yue Zhao View a PDF of the paper titled Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation, by Zerui Xu and 4 other authors View PDF Abstract:Machine learning (ML) holds great promise for clinical applications but is often hindered by limited access to high-quality data due to privacy concerns, high costs, and long timelines associated with clinical trials. While large language models (LLMs) have demonstrated strong performance in general-purpose generation tasks, their application to synthesizing realistic clinical trials remains underexplored. In this work, we propose a novel Retrieval-Reasoning framework that leverages few-shot prompting with LLMs to generate synthetic clinical trial reports annotated with binary success/failure outcomes. Our approach integrates a retrieval module to ground the generation on relevant trial data and a reasoning module to ensure domain-consistent justifications. Experiments conducted on real clinical trials from the this http URL database demonstrate that the generated synthetic trials effectively augment real datasets. Fine-tuning a BioBERT classifier on synthetic data, real data, or their combination shows that hybrid fine...