Research output: Contribution to journal › Article › peer-review
Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation. / Mengliev, Davlatyor; Barakhnin, Vladimir; Abdurakhmonova, Nilufar et al.
In: Data in Brief, Vol. 54, 110413, 06.2024.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation
AU - Mengliev, Davlatyor
AU - Barakhnin, Vladimir
AU - Abdurakhmonova, Nilufar
AU - Eshkulov, Mukhriddin
N1 - © 2024 The Author(s).
PY - 2024/6
Y1 - 2024/6
N2 - This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.
AB - This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.
KW - Language corpus
KW - Linguistic research
KW - Low-resource languages
KW - Named entity
KW - Uzbek language
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85191354558&origin=inward&txGid=1285c934aa772d2c8267bc3093408e03
UR - https://www.mendeley.com/catalogue/10442773-0240-31fa-94aa-1ae3c805f3f0/
U2 - 10.1016/j.dib.2024.110413
DO - 10.1016/j.dib.2024.110413
M3 - Article
C2 - 38708296
VL - 54
JO - Data in Brief
JF - Data in Brief
SN - 2352-3409
M1 - 110413
ER -
ID: 60559375