Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts

Standard

Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts. / Mengliev, Davlatyor; Abdurakhmonova, Nilufar; Allamov, Oybek et al.

AIP Conference Proceedings. Vol. 3356 American Institute of Physics Inc., 2025. 050002 (AIP Conference Proceedings; Vol. 3356).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review

Harvard

Mengliev, D, Abdurakhmonova, N, Allamov, O, Ibragimov, B, Saidov, B & Boltayev, N 2025, Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts. in AIP Conference Proceedings. vol. 3356, 050002, AIP Conference Proceedings, vol. 3356, American Institute of Physics Inc. https://doi.org/10.1063/5.0296177

APA

Mengliev, D., Abdurakhmonova, N., Allamov, O., Ibragimov, B., Saidov, B., & Boltayev, N. (2025). Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts. In AIP Conference Proceedings (Vol. 3356). [050002] (AIP Conference Proceedings; Vol. 3356). American Institute of Physics Inc.. https://doi.org/10.1063/5.0296177

Vancouver

Mengliev D, Abdurakhmonova N, Allamov O, Ibragimov B, Saidov B, Boltayev N. Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts. In AIP Conference Proceedings. Vol. 3356. American Institute of Physics Inc. 2025. 050002. (AIP Conference Proceedings). doi: 10.1063/5.0296177

Author

Mengliev, Davlatyor ; Abdurakhmonova, Nilufar ; Allamov, Oybek et al. / Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts. AIP Conference Proceedings. Vol. 3356 American Institute of Physics Inc., 2025. (AIP Conference Proceedings).

BibTeX

@inproceedings{7386871c762b45439a72219ac6d0e981,

title = "Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts",

abstract = "The paper discusses the development of a hybrid algorithm for recognizing named entities in Uzbek texts of the 20th century. Automatic processing of these texts is complicated by the agglutinative nature of the Uzbek language, multiple script reforms (from Arabic to Latin, then to Cyrillic and back to Latin), as well as the presence of dialects and dialect words. Existing methods and tools developed for other languages are ineffective for the Uzbek language due to its strong agglutinative dependence. In this paper, a hybrid approach is proposed that combines rule-based algorithms and a language model based on mBERT. To train the model, a custom language corpus was created consisting of 3000 sentences marked up using the BIOES scheme. Rule-based algorithms include transliteration of the text from Cyrillic to Latin and standardization of dialect words by replacing them with formal equivalents using dictionaries and morphological analysis. The results of the experiments showed that on the test set of 20th century texts, the accuracy was 91.4%, recall was 89.6%, and F1-measure was 90.5%. On the modern set of texts (21st century), the algorithm also showed high results with an F1-measure of 86.0%, demonstrating the ability to generalize. The developed algorithm effectively solves the problems of recognizing named entities in Uzbek texts.",

author = "Davlatyor Mengliev and Nilufar Abdurakhmonova and Oybek Allamov and Bahodir Ibragimov and Bobur Saidov and Nodirbek Boltayev",

note = "Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts / D. Mengliev, N. Abdurakhmonova, O. Allamov, B. Ibragimov, B. Saidov, N. Boltayev // AIP Conference Proceedings // American Institute of Physics Inc.: Соед. Штаты Америки. - 2025. - С. 050002",

year = "2025",

month = sep,

day = "15",

doi = "10.1063/5.0296177",

language = "English",

isbn = "9780735452589",

volume = "3356",

series = "AIP Conference Proceedings",

publisher = "American Institute of Physics Inc.",

booktitle = "AIP Conference Proceedings",

address = "United States",

}

RIS

TY - GEN

T1 - Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts

AU - Mengliev, Davlatyor

AU - Abdurakhmonova, Nilufar

AU - Allamov, Oybek

AU - Ibragimov, Bahodir

AU - Saidov, Bobur

AU - Boltayev, Nodirbek

N1 - Development of a hybrid algorithm for identifying named entities in 20th century Uzbek texts / D. Mengliev, N. Abdurakhmonova, O. Allamov, B. Ibragimov, B. Saidov, N. Boltayev // AIP Conference Proceedings // American Institute of Physics Inc.: Соед. Штаты Америки. - 2025. - С. 050002

PY - 2025/9/15

Y1 - 2025/9/15

N2 - The paper discusses the development of a hybrid algorithm for recognizing named entities in Uzbek texts of the 20th century. Automatic processing of these texts is complicated by the agglutinative nature of the Uzbek language, multiple script reforms (from Arabic to Latin, then to Cyrillic and back to Latin), as well as the presence of dialects and dialect words. Existing methods and tools developed for other languages are ineffective for the Uzbek language due to its strong agglutinative dependence. In this paper, a hybrid approach is proposed that combines rule-based algorithms and a language model based on mBERT. To train the model, a custom language corpus was created consisting of 3000 sentences marked up using the BIOES scheme. Rule-based algorithms include transliteration of the text from Cyrillic to Latin and standardization of dialect words by replacing them with formal equivalents using dictionaries and morphological analysis. The results of the experiments showed that on the test set of 20th century texts, the accuracy was 91.4%, recall was 89.6%, and F1-measure was 90.5%. On the modern set of texts (21st century), the algorithm also showed high results with an F1-measure of 86.0%, demonstrating the ability to generalize. The developed algorithm effectively solves the problems of recognizing named entities in Uzbek texts.

AB - The paper discusses the development of a hybrid algorithm for recognizing named entities in Uzbek texts of the 20th century. Automatic processing of these texts is complicated by the agglutinative nature of the Uzbek language, multiple script reforms (from Arabic to Latin, then to Cyrillic and back to Latin), as well as the presence of dialects and dialect words. Existing methods and tools developed for other languages are ineffective for the Uzbek language due to its strong agglutinative dependence. In this paper, a hybrid approach is proposed that combines rule-based algorithms and a language model based on mBERT. To train the model, a custom language corpus was created consisting of 3000 sentences marked up using the BIOES scheme. Rule-based algorithms include transliteration of the text from Cyrillic to Latin and standardization of dialect words by replacing them with formal equivalents using dictionaries and morphological analysis. The results of the experiments showed that on the test set of 20th century texts, the accuracy was 91.4%, recall was 89.6%, and F1-measure was 90.5%. On the modern set of texts (21st century), the algorithm also showed high results with an F1-measure of 86.0%, demonstrating the ability to generalize. The developed algorithm effectively solves the problems of recognizing named entities in Uzbek texts.

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105017825299&origin=inward

UR - https://www.mendeley.com/catalogue/55d62cba-47d0-3721-989c-c1fd36ebdc9f/

U2 - 10.1063/5.0296177

DO - 10.1063/5.0296177

M3 - Conference contribution

SN - 9780735452589

VL - 3356

T3 - AIP Conference Proceedings

BT - AIP Conference Proceedings

PB - American Institute of Physics Inc.

ER -

ID: 70630256