[2601.06932] Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Summary
The paper presents Symphonym, a neural embedding system designed for cross-script name matching, mapping names into a unified phonetic space to enhance similarity comparisons across languages and scripts.
Why It Matters
Cross-script name matching is crucial for digital humanities and geographic information retrieval, as it facilitates the linking of names from diverse historical and linguistic contexts. Symphonym's innovative approach addresses limitations of existing phonetic algorithms, offering a more efficient solution for researchers and practitioners in this field.
Key Takeaways
- Symphonym maps names into a unified 128-dimensional phonetic space.
- Utilizes a Teacher-Student architecture for embedding generation.
- Achieves high accuracy in cross-script name matching, outperforming traditional methods.
- Demonstrates effective application in historical name retrieval.
- Facilitates efficient workflows in digital humanities.
Computer Science > Computation and Language arXiv:2601.06932 (cs) [Submitted on 11 Jan 2026 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching Authors:Stephen Gadd View a PDF of the paper titled Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching, by Stephen Gadd View PDF Abstract:Linking names across historical sources, languages, and writing systems remains a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches require language-specific phonetic algorithms or fail to capture phonetic relationships across different scripts. This paper presents Symphonym, a neural embedding system that maps names from any script into a unified 128-dimensional phonetic space, enabling direct similarity comparison without runtime phonetic conversion. Symphonym uses a Teacher-Student architecture where a Teacher network trained on articulatory phonetic features produces target embeddings, while a Student network learns to approximate these embeddings directly from characters. The Teacher combines Epitran (extended with 100 new language-script mappings), Phonikud for Hebrew, and CharsiuG2P for Chinese, Japanese, and Korean. Training used 32.7 million triplet samples of toponyms spanning 20 writing systems from GeoNames, Wikidata, and Getty Thesaurus of Geographic Names. On the MEHDIE Hebrew-Arabic historical toponym benchmark, Symphonym achieves Rec...