[2602.19661] PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information
Summary
The paper presents PaReGTA, an LLM-based framework for encoding temporal information in electronic health records (EHRs), enhancing patient representation and classification accuracy.
Why It Matters
This research addresses the challenge of capturing temporal data in EHRs, which is crucial for improving patient care and outcomes. By utilizing a lightweight, pre-trained LLM approach, PaReGTA offers a scalable solution that can be applied to various healthcare datasets, making it relevant for researchers and practitioners in the field of health informatics.
Key Takeaways
- PaReGTA encodes longitudinal EHR events into structured templates with temporal cues.
- The framework uses lightweight contrastive fine-tuning for domain-adapted embeddings.
- It aggregates visit embeddings to create a fixed-dimensional patient representation.
- PaReGTA outperforms traditional sparse models in migraine classification tasks.
- The model is designed to be agnostic and can leverage future EHR-specialized models.
Computer Science > Machine Learning arXiv:2602.19661 (cs) [Submitted on 23 Feb 2026] Title:PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information Authors:Kihyuk Yoon, Lingchao Mao, Catherine Chong, Todd J. Schwedt, Chia-Chun Chiang, Jing Li View a PDF of the paper titled PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information, by Kihyuk Yoon and 5 other authors View PDF HTML (experimental) Abstract:Temporal information in structured electronic health records (EHRs) is often lost in sparse one-hot or count-based representations, while sequence models can be costly and data-hungry. We propose PaReGTA, an LLM-based encoding framework that (i) converts longitudinal EHR events into visit-level templated text with explicit temporal cues, (ii) learns domain-adapted visit embeddings via lightweight contrastive fine-tuning of a sentence-embedding model, and (iii) aggregates visit embeddings into a fixed-dimensional patient representation using hybrid temporal pooling that captures both recency and globally informative visits. Because PaReGTA does not require training from scratch but instead utilizes a pre-trained LLM, it can perform well even in data-limited cohorts. Furthermore, PaReGTA is model-agnostic and can benefit from future EHR-specialized sentence-embedding models. For interpretability, we introduce PaReGTA-RSS (Representation Shift Score), which quantifies clinically defined factor importance by recomputing representati...