[2510.24178] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
About this article
Abstract page for arXiv paper 2510.24178: MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Computer Science > Computation and Language arXiv:2510.24178 (cs) [Submitted on 28 Oct 2025 (v1), last revised 4 Mar 2026 (this version, v2)] Title:MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations Authors:Aaron Scott, Maike Züfle, Jan Niehues View a PDF of the paper titled MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations, by Aaron Scott and 2 other authors View PDF HTML (experimental) Abstract:Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in cur...