[2602.17949] CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Summary
CUICurate introduces a GraphRAG framework for automated curation of clinical concepts in NLP, enhancing efficiency and accuracy in clinical data processing.
Why It Matters
This framework addresses the labor-intensive process of clinical concept curation, which is crucial for effective NLP applications in healthcare. By automating the generation of concept sets, CUICurate significantly improves the scalability and reproducibility of clinical data analysis, ultimately aiding in better patient outcomes and research efficiency.
Key Takeaways
- CUICurate automates the curation of clinical concept sets, reducing manual effort.
- The framework utilizes a knowledge graph and large language models for enhanced accuracy.
- It outperforms manual benchmarks in producing larger and more complete concept sets.
- GPT-5-mini showed higher recall, while GPT-5 aligned better with clinician judgments.
- Outputs are stable and computationally efficient, making it suitable for various clinical NLP applications.
Computer Science > Computation and Language arXiv:2602.17949 (cs) [Submitted on 20 Feb 2026] Title:CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications Authors:Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego View a PDF of the paper titled CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications, by Victoria Blake and 3 other authors View PDF Abstract:Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standar...