Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Linguistic Nuances in Text Analysis: TF-IDF Metric's Algorithm Implementation for the Karakalpak Language Recognition. / Mengliev, Davlatyor; Eshkulov, Mukhriddin; Barakhnin, Vladimir et al.
Proceedings - 2024 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, USBEREIT 2024. Institute of Electrical and Electronics Engineers Inc., 2024. p. 19-22 (Proceedings - 2024 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, USBEREIT 2024).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Linguistic Nuances in Text Analysis: TF-IDF Metric's Algorithm Implementation for the Karakalpak Language Recognition
AU - Mengliev, Davlatyor
AU - Eshkulov, Mukhriddin
AU - Barakhnin, Vladimir
AU - Abdullayev, Ruslan
AU - Boltayev, Nodirbek
AU - Ibragimov, Bahodir
PY - 2024
Y1 - 2024
N2 - This article discusses an original approach to calculating the TF-IDF metric for Karakalpak language documents. The paper reviews related work, including efforts to automatically extract stop words and apply the TF-IDF metric tailored to the linguistic characteristics of the Karakalpak language, highlighting the importance of morphological preprocessing to improve the accuracy and efficiency of algorithms.Despite the challenges associated with the agglutinative nature of the Karakalpak language, such as the need for extensive morphological pre-processing to accurately identify and analyze word forms, this study proposes a new algorithm that demonstrates significant potential in dealing with the complexity of the language. By carefully adapting the TF-IDF metric to account for the morphological structure of Karakalpak, the proposed algorithm marks a significant advance in the computational analysis of agglutinative languages.Testing of the algorithm was thorough and included a diverse set of words unique to each dialect, as well as words common to multiple dialects and misspelled words. The algorithm has demonstrated high accuracy in identifying dialect-specific words and processing records in mixed dialects.In addition, this study contributes to the broader field of Turkic languages by offering insights into the structural and lexical features of the Uzbek language.
AB - This article discusses an original approach to calculating the TF-IDF metric for Karakalpak language documents. The paper reviews related work, including efforts to automatically extract stop words and apply the TF-IDF metric tailored to the linguistic characteristics of the Karakalpak language, highlighting the importance of morphological preprocessing to improve the accuracy and efficiency of algorithms.Despite the challenges associated with the agglutinative nature of the Karakalpak language, such as the need for extensive morphological pre-processing to accurately identify and analyze word forms, this study proposes a new algorithm that demonstrates significant potential in dealing with the complexity of the language. By carefully adapting the TF-IDF metric to account for the morphological structure of Karakalpak, the proposed algorithm marks a significant advance in the computational analysis of agglutinative languages.Testing of the algorithm was thorough and included a diverse set of words unique to each dialect, as well as words common to multiple dialects and misspelled words. The algorithm has demonstrated high accuracy in identifying dialect-specific words and processing records in mixed dialects.In addition, this study contributes to the broader field of Turkic languages by offering insights into the structural and lexical features of the Uzbek language.
KW - Agglutinative languages
KW - Karakalpak language
KW - TF-IDF
KW - compound word formation
KW - morphological structure
KW - natural language processing
KW - noun cases
KW - suffixation
KW - verb conjugation
KW - vowel harmony
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85199189313&origin=inward&txGid=b967a67803f95a2d6c82dc5b0761de33
UR - https://www.mendeley.com/catalogue/80e227ce-0e75-3502-8f76-6bac8ee32b4a/
U2 - 10.1109/USBEREIT61901.2024.10584051
DO - 10.1109/USBEREIT61901.2024.10584051
M3 - Conference contribution
SN - 9798350362893
T3 - Proceedings - 2024 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, USBEREIT 2024
SP - 19
EP - 22
BT - Proceedings - 2024 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, USBEREIT 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology
Y2 - 13 May 2024 through 15 May 2024
ER -
ID: 60463270