[2602.20805] Assessing the Impact of Speaker Identity in Speech Spoofing Detection
Summary
This paper investigates the influence of speaker identity on speech spoofing detection systems, proposing a framework that integrates speaker recognition and spoofing detection to enhance accuracy.
Why It Matters
As speech technology advances, ensuring robust spoofing detection is critical for security applications. This research challenges existing assumptions about speaker identity's role, potentially leading to improved detection systems that are less vulnerable to spoofing attacks.
Key Takeaways
- The study reveals that speaker identity significantly impacts spoofing detection performance.
- A new framework, SInMT, is proposed to model and mitigate speaker identity effects.
- The speaker-invariant model shows a 17% reduction in average equal error rate compared to traditional methods.
- The research highlights the importance of integrating multi-task learning in detection systems.
- Results indicate up to a 48% reduction in error rates for challenging spoofing attacks.
Computer Science > Sound arXiv:2602.20805 (cs) [Submitted on 24 Feb 2026] Title:Assessing the Impact of Speaker Identity in Speech Spoofing Detection Authors:Anh-Tuan Dao, Driss Matrouf, Nicholas Evans View a PDF of the paper titled Assessing the Impact of Speaker Identity in Speech Spoofing Detection, by Anh-Tuan Dao and 2 other authors View PDF HTML (experimental) Abstract:Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity. However, this assumption remains unverified. In this paper, we investigate the impact of speaker information on spoofing detection systems. We propose two approaches within our Speaker-Invariant Multi-Task framework, one that models speaker identity within the embeddings and another that removes it. SInMT integrates multi-task learning for joint speaker recognition and spoofing detection, incorporating a gradient reversal layer. Evaluated using four datasets, our speaker-invariant model reduces the average equal error rate by 17% compared to the baseline, with up to 48% reduction for the most challenging attacks (e.g., A11). Subjects: Sound (cs.SD); Machine Learning (cs.LG) Cite as: arXiv:2602.20805 [cs.SD] (or arXiv:2602.20805v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2602.20805 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tuan Dao [view email] [v1] Tu...