Standard

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts? / Morozov, Dmitry; Garipov, Timur; Lyashevskaya, Olga et al.

In: Journal of Language and Education, Vol. 10, No. 4, 30.12.2024, p. 71-84.

Research output: Contribution to journalArticlepeer-review

Harvard

Morozov, D, Garipov, T, Lyashevskaya, O, Savchuk, S, Iomdin, B & Glazkova, A 2024, 'Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?', Journal of Language and Education, vol. 10, no. 4, pp. 71-84. https://doi.org/10.17323/jle.2024.22237, https://doi.org/10.17323/jle.2024.v10.i4

APA

Morozov, D., Garipov, T., Lyashevskaya, O., Savchuk, S., Iomdin, B., & Glazkova, A. (2024). Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts? Journal of Language and Education, 10(4), 71-84. https://doi.org/10.17323/jle.2024.22237, https://doi.org/10.17323/jle.2024.v10.i4

Vancouver

Morozov D, Garipov T, Lyashevskaya O, Savchuk S, Iomdin B, Glazkova A. Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts? Journal of Language and Education. 2024 Dec 30;10(4):71-84. doi: 10.17323/jle.2024.22237, 10.17323/jle.2024.v10.i4

Author

Morozov, Dmitry ; Garipov, Timur ; Lyashevskaya, Olga et al. / Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?. In: Journal of Language and Education. 2024 ; Vol. 10, No. 4. pp. 71-84.

BibTeX

@article{e58f509fcd5d4af7a631d3595316c7d7,
title = "Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?",
abstract = "Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries. Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries. Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts. Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.",
author = "Dmitry Morozov and Timur Garipov and Olga Lyashevskaya and Svetlana Savchuk and Boris Iomdin and Anna Glazkova",
year = "2024",
month = dec,
day = "30",
doi = "10.17323/jle.2024.22237",
language = "English",
volume = "10",
pages = "71--84",
journal = "Journal of Language and Education",
issn = "2411-7390",
publisher = "Higher School of Economics, National Research University",
number = "4",

}

RIS

TY - JOUR

T1 - Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

AU - Morozov, Dmitry

AU - Garipov, Timur

AU - Lyashevskaya, Olga

AU - Savchuk, Svetlana

AU - Iomdin, Boris

AU - Glazkova, Anna

PY - 2024/12/30

Y1 - 2024/12/30

N2 - Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries. Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries. Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts. Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.

AB - Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries. Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries. Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts. Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.

UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85214864388&origin=inward&txGid=9c0769c95d840177aa12d8c60665e30c

U2 - 10.17323/jle.2024.22237

DO - 10.17323/jle.2024.22237

M3 - Article

VL - 10

SP - 71

EP - 84

JO - Journal of Language and Education

JF - Journal of Language and Education

SN - 2411-7390

IS - 4

ER -

ID: 62761480