Research output: Contribution to journal › Article › peer-review
A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language. / Mengliev, Davlatyor; Barakhnin, Vladimir; Eshkulov, Mukhriddin et al.
In: Data in Brief, Vol. 58, 111249, 02.2025.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language
AU - Mengliev, Davlatyor
AU - Barakhnin, Vladimir
AU - Eshkulov, Mukhriddin
AU - Ibragimov, Bahodir
AU - Madirimov, Shohrux
PY - 2025/2
Y1 - 2025/2
N2 - In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.
AB - In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.
KW - Language corpus
KW - Linguistic research
KW - Low-resource languages
KW - Named entity
KW - Uzbek language
UR - https://www.mendeley.com/catalogue/1fed1e13-8662-3a68-9940-47092327cb98/
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85212919315&origin=inward&txGid=f09c5273c340bee9dd44b982acb56057
U2 - 10.1016/j.dib.2024.111249
DO - 10.1016/j.dib.2024.111249
M3 - Article
C2 - 39811531
VL - 58
JO - Data in Brief
JF - Data in Brief
SN - 2352-3409
M1 - 111249
ER -
ID: 62799565