Research output: Chapter in Book/Report/Conference proceeding › Chapter › Research › peer-review
Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts. / Mengliev, Davlatyor; Abdurakhmonova, Nilufar; Allamov, Oybek et al.
AIP Conference Proceedings. Vol. 3356 American Institute of Physics Inc., 2025. 050002 (AIP Conference Proceedings; Vol. 3356).Research output: Chapter in Book/Report/Conference proceeding › Chapter › Research › peer-review
}
TY - CHAP
T1 - Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts
AU - Mengliev, Davlatyor
AU - Abdurakhmonova, Nilufar
AU - Allamov, Oybek
AU - Ibragimov, Bahodir
AU - Saidov, Bobur
AU - Boltayev, Nodirbek
N1 - Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts / D. Mengliev, N. Abdurakhmonova, O. Allamov, B. Ibragimov, B. Saidov, N. Boltayev // AIP Conference Proceedings // American Institute of Physics Inc.: Соед. Штаты Америки. - 2025. - С. 050002
PY - 2025/9/15
Y1 - 2025/9/15
N2 - The paper discusses the development of a hybrid algorithm for recognizing named entities in Uzbek texts of the 20th century. Automatic processing of these texts is complicated by the agglutinative nature of the Uzbek language, multiple script reforms (from Arabic to Latin, then to Cyrillic and back to Latin), as well as the presence of dialects and dialect words. Existing methods and tools developed for other languages are ineffective for the Uzbek language due to its strong agglutinative dependence. In this paper, a hybrid approach is proposed that combines rule-based algorithms and a language model based on mBERT. To train the model, a custom language corpus was created consisting of 3000 sentences marked up using the BIOES scheme. Rule-based algorithms include transliteration of the text from Cyrillic to Latin and standardization of dialect words by replacing them with formal equivalents using dictionaries and morphological analysis. The results of the experiments showed that on the test set of 20th century texts, the accuracy was 91.4%, recall was 89.6%, and F1-measure was 90.5%. On the modern set of texts (21st century), the algorithm also showed high results with an F1-measure of 86.0%, demonstrating the ability to generalize. The developed algorithm effectively solves the problems of recognizing named entities in Uzbek texts.
AB - The paper discusses the development of a hybrid algorithm for recognizing named entities in Uzbek texts of the 20th century. Automatic processing of these texts is complicated by the agglutinative nature of the Uzbek language, multiple script reforms (from Arabic to Latin, then to Cyrillic and back to Latin), as well as the presence of dialects and dialect words. Existing methods and tools developed for other languages are ineffective for the Uzbek language due to its strong agglutinative dependence. In this paper, a hybrid approach is proposed that combines rule-based algorithms and a language model based on mBERT. To train the model, a custom language corpus was created consisting of 3000 sentences marked up using the BIOES scheme. Rule-based algorithms include transliteration of the text from Cyrillic to Latin and standardization of dialect words by replacing them with formal equivalents using dictionaries and morphological analysis. The results of the experiments showed that on the test set of 20th century texts, the accuracy was 91.4%, recall was 89.6%, and F1-measure was 90.5%. On the modern set of texts (21st century), the algorithm also showed high results with an F1-measure of 86.0%, demonstrating the ability to generalize. The developed algorithm effectively solves the problems of recognizing named entities in Uzbek texts.
UR - https://www.mendeley.com/catalogue/55d62cba-47d0-3721-989c-c1fd36ebdc9f/
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105017825299&origin=inward
U2 - 10.1063/5.0296177
DO - 10.1063/5.0296177
M3 - Chapter
SN - 9780735452589
VL - 3356
T3 - AIP Conference Proceedings
BT - AIP Conference Proceedings
PB - American Institute of Physics Inc.
ER -
ID: 70630256