A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language

Standard

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language. / Mengliev, Davlatyor ; Barakhnin, Vladimir; Eshkulov, Mukhriddin и др.

в: Data in Brief, Том 58, 111249, 02.2025.

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

BibTeX

@article{9cf871767ecf477fbab33da4c09a5570,

title = "A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language",

abstract = "In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.",

keywords = "Language corpus, Linguistic research, Low-resource languages, Named entity, Uzbek language",

author = "Davlatyor Mengliev and Vladimir Barakhnin and Mukhriddin Eshkulov and Bahodir Ibragimov and Shohrux Madirimov",

year = "2025",

month = feb,

doi = "10.1016/j.dib.2024.111249",

language = "English",

volume = "58",

journal = "Data in Brief",

issn = "2352-3409",

publisher = "Elsevier Science Publishing Company, Inc.",

}

RIS

TY - JOUR

T1 - A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language

AU - Mengliev, Davlatyor

AU - Barakhnin, Vladimir

AU - Eshkulov, Mukhriddin

AU - Ibragimov, Bahodir

AU - Madirimov, Shohrux

PY - 2025/2

Y1 - 2025/2

N2 - In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.

AB - In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.

KW - Language corpus

KW - Linguistic research

KW - Low-resource languages

KW - Named entity

KW - Uzbek language

UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85212919315&origin=inward&txGid=f09c5273c340bee9dd44b982acb56057

UR - https://www.mendeley.com/catalogue/1fed1e13-8662-3a68-9940-47092327cb98/

U2 - 10.1016/j.dib.2024.111249

DO - 10.1016/j.dib.2024.111249

M3 - Article

C2 - 39811531

VL - 58

JO - Data in Brief

JF - Data in Brief

SN - 2352-3409

M1 - 111249

ER -

ID: 62799565