Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. / Legchenko, Anton; Bondarenko, Ivan.
Speech and Computer. ed. / Alexey Karpov; Gábor Gosztolya. Springer, 2026. p. 130-143 10 (Lecture Notes in Computer Science; Vol. 16188 LNCS).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR
AU - Legchenko, Anton
AU - Bondarenko, Ivan
N1 - Conference code: 27
PY - 2026
Y1 - 2026
N2 - This paper presents methods to improve the accuracy and robustness of multilingual automatic speech recognition (ASR) systems transcribing speech into International Phonetic Alphabet (IPA) sequences. The development of such systems faces considerable challenges, including linguistic diversity, pronunciation variability, and especially the scarcity of high-quality annotated resources for many languages, which hinders model generalization to unseen linguistic domains. We propose a framework that explicitly integrates prior linguistic knowledge into the model training process and leverages auxiliary information via hierarchical multi-task learning (HMTL). The method decomposes phoneme recognition into several levels of abstraction, thus enabling the model to capture both language-independent and language-specific phonetic patterns. Furthermore, we introduce and compare two types of language vector representations, obtained respectively from acoustic signals and from phonetic transcriptions, and evaluate their utility as auxiliary inputs, particularly for low-resource and zero-shot scenarios. Experiments were conducted on multilingual corpora with both high- and low-resource languages, employing a pre-trained Wav2Vec 2.0 architecture as the base model. Baseline models were fine-tuned using Connectionist Temporal Classification (CTC) loss without auxiliary information. Phoneme Error Rate (PER) was used for evaluation in both in-domain and out-of-domain settings. The results demonstrate a relative improvement in recognition accuracy of 7–10% for most scenarios, and an improvement exceeding 20% for out-of-domain languages under reduced training data conditions.
AB - This paper presents methods to improve the accuracy and robustness of multilingual automatic speech recognition (ASR) systems transcribing speech into International Phonetic Alphabet (IPA) sequences. The development of such systems faces considerable challenges, including linguistic diversity, pronunciation variability, and especially the scarcity of high-quality annotated resources for many languages, which hinders model generalization to unseen linguistic domains. We propose a framework that explicitly integrates prior linguistic knowledge into the model training process and leverages auxiliary information via hierarchical multi-task learning (HMTL). The method decomposes phoneme recognition into several levels of abstraction, thus enabling the model to capture both language-independent and language-specific phonetic patterns. Furthermore, we introduce and compare two types of language vector representations, obtained respectively from acoustic signals and from phonetic transcriptions, and evaluate their utility as auxiliary inputs, particularly for low-resource and zero-shot scenarios. Experiments were conducted on multilingual corpora with both high- and low-resource languages, employing a pre-trained Wav2Vec 2.0 architecture as the base model. Baseline models were fine-tuned using Connectionist Temporal Classification (CTC) loss without auxiliary information. Phoneme Error Rate (PER) was used for evaluation in both in-domain and out-of-domain settings. The results demonstrate a relative improvement in recognition accuracy of 7–10% for most scenarios, and an improvement exceeding 20% for out-of-domain languages under reduced training data conditions.
KW - Hierarchical multi-task learning
KW - IPA transcription
KW - Language embeddings
KW - Multilingual ASR
KW - Phoneme recognition
KW - Speech recognition
UR - https://www.scopus.com/pages/publications/105020259101
UR - https://www.mendeley.com/catalogue/937c9674-58bd-3bf9-b0d8-73bf3cc3cd1e/
U2 - 10.1007/978-3-032-07959-6_10
DO - 10.1007/978-3-032-07959-6_10
M3 - Conference contribution
SN - 978-3-032-07958-9
T3 - Lecture Notes in Computer Science
SP - 130
EP - 143
BT - Speech and Computer
A2 - Karpov, Alexey
A2 - Gosztolya, Gábor
PB - Springer
T2 - 27th International Conference Speech and Computer 2025
Y2 - 13 October 2025 through 15 October 2025
ER -
ID: 71987636