[2602.18899] [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

[2602.18899] [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

arXiv - Machine Learning 3 min read Article

Summary

This article explores how self-supervised speech models (S3Ms) encode phonological information, revealing linear relationships in their representation space that correspond to phonological features across 96 languages.

Why It Matters

Understanding the structure of phonological information in S3Ms is crucial for advancing natural language processing and speech recognition technologies. This research provides insights into how these models can be improved to better understand and generate human speech, which has implications for various applications in AI and linguistics.

Key Takeaways

  • S3Ms encode rich phonetic information that is structured in a meaningful way.
  • Linear directions in the model's representation correspond to phonological features.
  • The scale of phonological vectors correlates with the acoustic realization of features.
  • The study demonstrates phonological vector arithmetic, enhancing our understanding of speech models.
  • Findings could lead to improved applications in speech recognition and natural language processing.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.18899 (eess) [Submitted on 21 Feb 2026] Title:[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic Authors:Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen View a PDF of the paper titled [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic, by Kwanghee Choi and Eunjung Yeo and Cheol Jun Cho and David Harwath and David R. Mortensen View PDF Abstract:Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code ...

Related Articles

Machine Learning

[D] ICML 2026 Average Score

Hi all, I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase. For those who are reviewers (or...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video i...

Reddit - Machine Learning · 1 min ·
Machine Learning

FLUX 2 Pro (2026) Sketch to Image

I sketched a cow and tested how different models interpret it into a realistic image for downstream 3D generation, turns out some models ...

Reddit - Artificial Intelligence · 1 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime