[2603.01557] Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
About this article
Abstract page for arXiv paper 2603.01557: Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
Computer Science > Artificial Intelligence arXiv:2603.01557 (cs) [Submitted on 2 Mar 2026] Title:Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring Authors:Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi, Shaun Tan, Jieru Li, Benoit Marteau, Brad Willingham, May Wang View a PDF of the paper titled Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring, by Aditya Shukla and 9 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a...