[2601.19922] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Summary
The paper introduces HEART, a benchmark for evaluating emotional support dialogue in humans and LLMs, focusing on empathy and communication skills.
Why It Matters
HEART addresses the gap in assessing emotional support capabilities of LLMs compared to humans. By providing a standardized framework, it enhances understanding of AI's role in supportive conversations, which is crucial as AI systems increasingly interact with users in sensitive contexts.
Key Takeaways
- HEART is the first framework to compare human and LLM responses in emotional dialogues.
- It evaluates interactions based on five dimensions of interpersonal communication.
- Some LLMs show comparable empathy to humans, while humans excel in nuanced emotional responses.
- The study reveals a convergence in assessment criteria between human and LLM evaluators.
- HEART provides a foundation for future research on emotional competence in AI.
Computer Science > Computation and Language arXiv:2601.19922 (cs) [Submitted on 9 Jan 2026 (v1), last revised 25 Feb 2026 (this version, v2)] Title:HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue Authors:Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong, Subhabrata Mukherjee View a PDF of the paper titled HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue, by Laya Iyer and Kriti Aggarwal and 4 other authors View PDF HTML (experimental) Abstract:Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceive...