Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR

Standard

Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. / Legchenko, Anton; Bondarenko, Ivan.

Speech and Computer. ed. / Alexey Karpov; Gábor Gosztolya. Springer, 2026. p. 130-143 10 (Lecture Notes in Computer Science; Vol. 16188 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review

Harvard

Legchenko, A & Bondarenko, I 2026, Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. in A Karpov & G Gosztolya (eds), Speech and Computer., 10, Lecture Notes in Computer Science, vol. 16188 LNCS, Springer, pp. 130-143, 27th International Conference Speech and Computer 2025, Szeged, Hungary, 13.10.2025. https://doi.org/10.1007/978-3-032-07959-6_10

APA

Legchenko, A., & Bondarenko, I. (2026). Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. In A. Karpov, & G. Gosztolya (Eds.), Speech and Computer (pp. 130-143). [10] (Lecture Notes in Computer Science; Vol. 16188 LNCS). Springer. https://doi.org/10.1007/978-3-032-07959-6_10

Vancouver

Legchenko A, Bondarenko I. Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. In Karpov A, Gosztolya G, editors, Speech and Computer. Springer. 2026. p. 130-143. 10. (Lecture Notes in Computer Science). doi: 10.1007/978-3-032-07959-6_10

Author

Legchenko, Anton ; Bondarenko, Ivan. / Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. Speech and Computer. editor / Alexey Karpov ; Gábor Gosztolya. Springer, 2026. pp. 130-143 (Lecture Notes in Computer Science).

BibTeX

@inproceedings{f7d4e80fe6684651bb5632a58fb954e4,

title = "Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR",

abstract = "This paper presents methods to improve the accuracy and robustness of multilingual automatic speech recognition (ASR) systems transcribing speech into International Phonetic Alphabet (IPA) sequences. The development of such systems faces considerable challenges, including linguistic diversity, pronunciation variability, and especially the scarcity of high-quality annotated resources for many languages, which hinders model generalization to unseen linguistic domains. We propose a framework that explicitly integrates prior linguistic knowledge into the model training process and leverages auxiliary information via hierarchical multi-task learning (HMTL). The method decomposes phoneme recognition into several levels of abstraction, thus enabling the model to capture both language-independent and language-specific phonetic patterns. Furthermore, we introduce and compare two types of language vector representations, obtained respectively from acoustic signals and from phonetic transcriptions, and evaluate their utility as auxiliary inputs, particularly for low-resource and zero-shot scenarios. Experiments were conducted on multilingual corpora with both high- and low-resource languages, employing a pre-trained Wav2Vec 2.0 architecture as the base model. Baseline models were fine-tuned using Connectionist Temporal Classification (CTC) loss without auxiliary information. Phoneme Error Rate (PER) was used for evaluation in both in-domain and out-of-domain settings. The results demonstrate a relative improvement in recognition accuracy of 7–10% for most scenarios, and an improvement exceeding 20% for out-of-domain languages under reduced training data conditions.",

keywords = "Hierarchical multi-task learning, IPA transcription, Language embeddings, Multilingual ASR, Phoneme recognition, Speech recognition",

author = "Anton Legchenko and Ivan Bondarenko",

note = "Legchenko, A., Bondarenko, I. (2026). Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR. In: Karpov, A., Gosztolya, G. (eds) Speech and Computer. SPECOM 2025. Lecture Notes in Computer Science(), vol 16188. Springer, Cham. https://doi.org/10.1007/978-3-032-07959-6_10 The authors thank Novosibirsk State University for support and provision of computational resources.; 27th International Conference Speech and Computer 2025, SPECOM 2025 ; Conference date: 13-10-2025 Through 15-10-2025",

year = "2026",

doi = "10.1007/978-3-032-07959-6_10",

language = "English",

isbn = "978-3-032-07958-9",

series = "Lecture Notes in Computer Science",

publisher = "Springer",

pages = "130--143",

editor = "Alexey Karpov and G{\'a}bor Gosztolya",

booktitle = "Speech and Computer",

address = "United States",

}

RIS

TY - GEN

T1 - Domain Knowledge and Language Embeddings for Low-Resource Multilingual Phoneme ASR

AU - Legchenko, Anton

AU - Bondarenko, Ivan

N1 - Conference code: 27

PY - 2026

Y1 - 2026

N2 - This paper presents methods to improve the accuracy and robustness of multilingual automatic speech recognition (ASR) systems transcribing speech into International Phonetic Alphabet (IPA) sequences. The development of such systems faces considerable challenges, including linguistic diversity, pronunciation variability, and especially the scarcity of high-quality annotated resources for many languages, which hinders model generalization to unseen linguistic domains. We propose a framework that explicitly integrates prior linguistic knowledge into the model training process and leverages auxiliary information via hierarchical multi-task learning (HMTL). The method decomposes phoneme recognition into several levels of abstraction, thus enabling the model to capture both language-independent and language-specific phonetic patterns. Furthermore, we introduce and compare two types of language vector representations, obtained respectively from acoustic signals and from phonetic transcriptions, and evaluate their utility as auxiliary inputs, particularly for low-resource and zero-shot scenarios. Experiments were conducted on multilingual corpora with both high- and low-resource languages, employing a pre-trained Wav2Vec 2.0 architecture as the base model. Baseline models were fine-tuned using Connectionist Temporal Classification (CTC) loss without auxiliary information. Phoneme Error Rate (PER) was used for evaluation in both in-domain and out-of-domain settings. The results demonstrate a relative improvement in recognition accuracy of 7–10% for most scenarios, and an improvement exceeding 20% for out-of-domain languages under reduced training data conditions.

AB - This paper presents methods to improve the accuracy and robustness of multilingual automatic speech recognition (ASR) systems transcribing speech into International Phonetic Alphabet (IPA) sequences. The development of such systems faces considerable challenges, including linguistic diversity, pronunciation variability, and especially the scarcity of high-quality annotated resources for many languages, which hinders model generalization to unseen linguistic domains. We propose a framework that explicitly integrates prior linguistic knowledge into the model training process and leverages auxiliary information via hierarchical multi-task learning (HMTL). The method decomposes phoneme recognition into several levels of abstraction, thus enabling the model to capture both language-independent and language-specific phonetic patterns. Furthermore, we introduce and compare two types of language vector representations, obtained respectively from acoustic signals and from phonetic transcriptions, and evaluate their utility as auxiliary inputs, particularly for low-resource and zero-shot scenarios. Experiments were conducted on multilingual corpora with both high- and low-resource languages, employing a pre-trained Wav2Vec 2.0 architecture as the base model. Baseline models were fine-tuned using Connectionist Temporal Classification (CTC) loss without auxiliary information. Phoneme Error Rate (PER) was used for evaluation in both in-domain and out-of-domain settings. The results demonstrate a relative improvement in recognition accuracy of 7–10% for most scenarios, and an improvement exceeding 20% for out-of-domain languages under reduced training data conditions.

KW - Hierarchical multi-task learning

KW - IPA transcription

KW - Language embeddings

KW - Multilingual ASR

KW - Phoneme recognition

KW - Speech recognition

UR - https://www.scopus.com/pages/publications/105020259101

UR - https://www.mendeley.com/catalogue/937c9674-58bd-3bf9-b0d8-73bf3cc3cd1e/

U2 - 10.1007/978-3-032-07959-6_10

DO - 10.1007/978-3-032-07959-6_10

M3 - Conference contribution

SN - 978-3-032-07958-9

T3 - Lecture Notes in Computer Science

SP - 130

EP - 143

BT - Speech and Computer

A2 - Karpov, Alexey

A2 - Gosztolya, Gábor

PB - Springer

T2 - 27th International Conference Speech and Computer 2025

Y2 - 13 October 2025 through 15 October 2025

ER -

ID: 71987636