Standard

The design of the structure of the software system for processing text document corpus. / Barakhnin, Vladimir B.; Kozhemyakina, Olga Yu; Mukhamediev, Ravil I. et al.

In: Business Informatics, Vol. 13, No. 4, 2019, p. 60-72.

Research output: Contribution to journalArticlepeer-review

Harvard

Barakhnin, VB, Kozhemyakina, OY, Mukhamediev, RI, Borzilova, YS & Yakunin, KO 2019, 'The design of the structure of the software system for processing text document corpus', Business Informatics, vol. 13, no. 4, pp. 60-72. https://doi.org/10.17323/1998-0663.2019.4.60.72

APA

Barakhnin, V. B., Kozhemyakina, O. Y., Mukhamediev, R. I., Borzilova, Y. S., & Yakunin, K. O. (2019). The design of the structure of the software system for processing text document corpus. Business Informatics, 13(4), 60-72. https://doi.org/10.17323/1998-0663.2019.4.60.72

Vancouver

Barakhnin VB, Kozhemyakina OY, Mukhamediev RI, Borzilova YS, Yakunin KO. The design of the structure of the software system for processing text document corpus. Business Informatics. 2019;13(4):60-72. doi: 10.17323/1998-0663.2019.4.60.72

Author

Barakhnin, Vladimir B. ; Kozhemyakina, Olga Yu ; Mukhamediev, Ravil I. et al. / The design of the structure of the software system for processing text document corpus. In: Business Informatics. 2019 ; Vol. 13, No. 4. pp. 60-72.

BibTeX

@article{5122cd94774d419ab3ffa398d03d232e,
title = "The design of the structure of the software system for processing text document corpus",
abstract = "One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language processing is to be found in the following factors: the specificity of the structure of literary and business style texts (all of which requires the formation of separate datasets and, in the case of machine learning methods, the additional feature selection) and the lack of complete systems of mass processing of text documents for the Russian language (in relation to the scientific community-in the commercial environment, there are some systems of smaller scale, which are solving highly specialized tasks, for example, the definition of the tonality of the text). The aim of the current study is to design and further develop the structure of a text document corpus processing system. The design took into account the requirements for large-scale systems: modularity, the ability to scale components, the conditional independence of components. The system we designed is a set of components, each of which is formed and used in the form of Docker-containers. The levels of the system are: the data processing level, the data storage level, the visualization and management of the results of data processing (visualization and management level). At the data processing level, the text documents (for example, news events) are collected (scrapped) and further processed using an ensemble of machine learning methods, each of which is implemented in the system as a separate Airflow-task. The results are placed for storage in a relational database; ElasticSearch is used to increase the speed of data search (more than 1 million units). The visualization of statistics which is obtained as a result of the algorithms is carried out using the Plotly plugin. The administration and the viewing of processed texts are available through a web-interface using the Django framework. The general scheme of the interaction of components is organized on the principle of ETL (extract, transform, load). Currently the system is used to analyze the corpus of news texts in order to identify information of a destructive nature. In the future, we expect to improve the system and to publish the components in the open repository GitHub for access by the scientific community.",
keywords = "Development of a text corpus processing system, Natural language processing, Streaming word processing, Text analysis information system",
author = "Barakhnin, {Vladimir B.} and Kozhemyakina, {Olga Yu} and Mukhamediev, {Ravil I.} and Borzilova, {Yulia S.} and Yakunin, {Kirill O.}",
note = "Funding Information: This work was funded by grant NoBR05236839 of the Ministry of Education and Science of the Republic of Kazakhstan, by the Russian Fund of Basic Research, project No 19-31-27001 and within the framework of the state task theme No АААА-А17-117120670141-7 (No 0316-2018-0009). Publisher Copyright: {\textcopyright} 2010 Sociedad Mexicana de Psicologia. All rights reserved. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.",
year = "2019",
doi = "10.17323/1998-0663.2019.4.60.72",
language = "English",
volume = "13",
pages = "60--72",
journal = "Business Informatics",
issn = "2587-814X",
publisher = "National Research University, Higher School of Econoimics",
number = "4",

}

RIS

TY - JOUR

T1 - The design of the structure of the software system for processing text document corpus

AU - Barakhnin, Vladimir B.

AU - Kozhemyakina, Olga Yu

AU - Mukhamediev, Ravil I.

AU - Borzilova, Yulia S.

AU - Yakunin, Kirill O.

N1 - Funding Information: This work was funded by grant NoBR05236839 of the Ministry of Education and Science of the Republic of Kazakhstan, by the Russian Fund of Basic Research, project No 19-31-27001 and within the framework of the state task theme No АААА-А17-117120670141-7 (No 0316-2018-0009). Publisher Copyright: © 2010 Sociedad Mexicana de Psicologia. All rights reserved. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.

PY - 2019

Y1 - 2019

N2 - One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language processing is to be found in the following factors: the specificity of the structure of literary and business style texts (all of which requires the formation of separate datasets and, in the case of machine learning methods, the additional feature selection) and the lack of complete systems of mass processing of text documents for the Russian language (in relation to the scientific community-in the commercial environment, there are some systems of smaller scale, which are solving highly specialized tasks, for example, the definition of the tonality of the text). The aim of the current study is to design and further develop the structure of a text document corpus processing system. The design took into account the requirements for large-scale systems: modularity, the ability to scale components, the conditional independence of components. The system we designed is a set of components, each of which is formed and used in the form of Docker-containers. The levels of the system are: the data processing level, the data storage level, the visualization and management of the results of data processing (visualization and management level). At the data processing level, the text documents (for example, news events) are collected (scrapped) and further processed using an ensemble of machine learning methods, each of which is implemented in the system as a separate Airflow-task. The results are placed for storage in a relational database; ElasticSearch is used to increase the speed of data search (more than 1 million units). The visualization of statistics which is obtained as a result of the algorithms is carried out using the Plotly plugin. The administration and the viewing of processed texts are available through a web-interface using the Django framework. The general scheme of the interaction of components is organized on the principle of ETL (extract, transform, load). Currently the system is used to analyze the corpus of news texts in order to identify information of a destructive nature. In the future, we expect to improve the system and to publish the components in the open repository GitHub for access by the scientific community.

AB - One of the most difficult tasks in the field of data mining is the development of universal tools for the analysis of texts written in the literary and business styles. A popular path in the development of algorithms for processing text document corpus is the use of machine learning methods that allow one to solve NLP (natural language processing) tasks. The basis for research in the field of natural language processing is to be found in the following factors: the specificity of the structure of literary and business style texts (all of which requires the formation of separate datasets and, in the case of machine learning methods, the additional feature selection) and the lack of complete systems of mass processing of text documents for the Russian language (in relation to the scientific community-in the commercial environment, there are some systems of smaller scale, which are solving highly specialized tasks, for example, the definition of the tonality of the text). The aim of the current study is to design and further develop the structure of a text document corpus processing system. The design took into account the requirements for large-scale systems: modularity, the ability to scale components, the conditional independence of components. The system we designed is a set of components, each of which is formed and used in the form of Docker-containers. The levels of the system are: the data processing level, the data storage level, the visualization and management of the results of data processing (visualization and management level). At the data processing level, the text documents (for example, news events) are collected (scrapped) and further processed using an ensemble of machine learning methods, each of which is implemented in the system as a separate Airflow-task. The results are placed for storage in a relational database; ElasticSearch is used to increase the speed of data search (more than 1 million units). The visualization of statistics which is obtained as a result of the algorithms is carried out using the Plotly plugin. The administration and the viewing of processed texts are available through a web-interface using the Django framework. The general scheme of the interaction of components is organized on the principle of ETL (extract, transform, load). Currently the system is used to analyze the corpus of news texts in order to identify information of a destructive nature. In the future, we expect to improve the system and to publish the components in the open repository GitHub for access by the scientific community.

KW - Development of a text corpus processing system

KW - Natural language processing

KW - Streaming word processing

KW - Text analysis information system

UR - http://www.scopus.com/inward/record.url?scp=85096699051&partnerID=8YFLogxK

U2 - 10.17323/1998-0663.2019.4.60.72

DO - 10.17323/1998-0663.2019.4.60.72

M3 - Article

AN - SCOPUS:85096699051

VL - 13

SP - 60

EP - 72

JO - Business Informatics

JF - Business Informatics

SN - 2587-814X

IS - 4

ER -

ID: 27081790