Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Standard

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation. / Mengliev, Davlatyor ; Barakhnin, Vladimir; Abdurakhmonova, Nilufar et al.

In: Data in Brief, Vol. 54, 110413, 06.2024.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{e31de10aa12d4698a6a57d97add2ea7a,

title = "Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation",

abstract = "This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.",

keywords = "Language corpus, Linguistic research, Low-resource languages, Named entity, Uzbek language",

author = "Davlatyor Mengliev and Vladimir Barakhnin and Nilufar Abdurakhmonova and Mukhriddin Eshkulov",

note = "{\textcopyright} 2024 The Author(s).",

year = "2024",

month = jun,

doi = "10.1016/j.dib.2024.110413",

language = "English",

volume = "54",

journal = "Data in Brief",

issn = "2352-3409",

publisher = "Elsevier Science Publishing Company, Inc.",

}

RIS

TY - JOUR

T1 - Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

AU - Mengliev, Davlatyor

AU - Barakhnin, Vladimir

AU - Abdurakhmonova, Nilufar

AU - Eshkulov, Mukhriddin

PY - 2024/6

Y1 - 2024/6

N2 - This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

AB - This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

KW - Language corpus

KW - Linguistic research

KW - Low-resource languages

KW - Named entity

KW - Uzbek language

UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85191354558&origin=inward&txGid=1285c934aa772d2c8267bc3093408e03

UR - https://www.mendeley.com/catalogue/10442773-0240-31fa-94aa-1ae3c805f3f0/

U2 - 10.1016/j.dib.2024.110413

DO - 10.1016/j.dib.2024.110413

M3 - Article

C2 - 38708296

VL - 54

JO - Data in Brief

JF - Data in Brief

SN - 2352-3409

M1 - 110413

ER -

ID: 60559375