Standard

Kaznewsdataset : Single country overall digital mass media publication corpus. / Yakunin, Kirill; Kalimoldayev, Maksat; Mukhamediev, Ravil I. et al.

In: Data, Vol. 6, No. 3, 31, 03.2021.

Research output: Contribution to journalArticlepeer-review

Harvard

Yakunin, K, Kalimoldayev, M, Mukhamediev, RI, Mussabayev, R, Barakhnin, V, Kuchin, Y, Murzakhmetov, S, Buldybayev, T, Ospanova, U, Yelis, M, Zhumabayev, A, Gopejenko, V, Meirambekkyzy, Z & Abdurazakov, A 2021, 'Kaznewsdataset: Single country overall digital mass media publication corpus', Data, vol. 6, no. 3, 31. https://doi.org/10.3390/data6030031

APA

Yakunin, K., Kalimoldayev, M., Mukhamediev, R. I., Mussabayev, R., Barakhnin, V., Kuchin, Y., Murzakhmetov, S., Buldybayev, T., Ospanova, U., Yelis, M., Zhumabayev, A., Gopejenko, V., Meirambekkyzy, Z., & Abdurazakov, A. (2021). Kaznewsdataset: Single country overall digital mass media publication corpus. Data, 6(3), [31]. https://doi.org/10.3390/data6030031

Vancouver

Yakunin K, Kalimoldayev M, Mukhamediev RI, Mussabayev R, Barakhnin V, Kuchin Y et al. Kaznewsdataset: Single country overall digital mass media publication corpus. Data. 2021 Mar;6(3):31. doi: 10.3390/data6030031

Author

Yakunin, Kirill ; Kalimoldayev, Maksat ; Mukhamediev, Ravil I. et al. / Kaznewsdataset : Single country overall digital mass media publication corpus. In: Data. 2021 ; Vol. 6, No. 3.

BibTeX

@article{718a0fe9cccd476eb6510e77f536f0c9,
title = "Kaznewsdataset: Single country overall digital mass media publication corpus",
abstract = "Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.",
keywords = "ARTM, Computer modeling, LDA, Mass-media, Multiple-criteria decision-making (MCDM), Natural language processing, Propaganda identification, Sentiment analysis, Significant social news, Topic modeling",
author = "Kirill Yakunin and Maksat Kalimoldayev and Mukhamediev, {Ravil I.} and Rustam Mussabayev and Vladimir Barakhnin and Yan Kuchin and Sanzhar Murzakhmetov and Timur Buldybayev and Ulzhan Ospanova and Marina Yelis and Akylbek Zhumabayev and Viktors Gopejenko and Zhazirakhanym Meirambekkyzy and Alibek Abdurazakov",
note = "Funding Information: state development program documents. The corpus was used in several research cases, of state development program documents. The corpus was used in several research such as identification of propaganda, assessment of the sentiment of publications, calcu-cases, such as identification of propaganda, assessment of the sentiment of publications, lation of the level of socially significant negativity, comparative analysis of publication calculation of the level of socially significant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the social significance of news using the topic model of the text corpus since evaluating the social significance of news using the topic model of the text corpus since an an area under receiver operating characteristics curve (ROC AUC) score of 0.81 was area under receiver operating characteristics curve (ROC AUC) score of 0.81 was achieved achieved in the classification task, which is comparable with results obtained for the same in the classification task, which is comparable with results obtained for the same task by task by applying the bidirectional encoder representations from transformers (BERT) applying the bidirectional encoder representations from transformers (BERT) model. The model. The proposed method of identifying texts with propagandistic content was cross-proposed method of identifying texts with propagandistic content was cross-validated on validated on a labeled subsample of 1000 news and showed high predictive power—ROC a labeled subsample of 1000 news and showed high predictive power—ROC AUC 0.73. In AUC 0.73. In the task of sentiment analysis, the proposed method showed a 0.93 ROC the task of sentiment analysis, the proposed method showed a 0.93 ROC AUC score. AUC score. Despite the noted limitations, the corpus will be of interest to researchers analyzing Despite the noted limitations, the corpus will be of interest to researchers analyzing media, including comparative analysis and identification of common patterns inherent in media, including comparative analysis and identification of common patterns inherent in the media of different countries. the media of One of thedifferent codirectionsuntries. of further research of the corpus is the analysis of publication One of the directions of further research of the corpus is the analysis of publication activity related to individual organizations, topics and events, for example, healthcare and the COVID-19 pandemic. Author Contributions: Conceptualization, R.I.M., K.Y. and R.M.; methodology, R.I.M. and R.M.; software, K.Y. and S.M.; validation, V.G., T.B., U.O. and Y.K.; formal analysis, R.I.M.; investigation, software, K.Y. and S.M.; validation, V.G., T.B., U.O. and Y.K.; formal analysis, R.I.M.; investigation, K.Y., V.B., M.Y., Y.K., T.B., U.O., A.Z., V.G., A.A. and S.M.; resources, M.K., R.M.; data curation, K.Y., Z.M., A.A., A.Z. and S.M.; writing—original draft preparation, R.I.M. and K.Y.; writing—review and editing, M.Y., V.B. and K.Y.; visualization, R.I.M. and K.Y.; supervision, R.I.M. and R.M.; project administration, M.K. and R.M.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the Committee of Science under the Ministry of Education and Science of the Republic of Kazakhstan, grant AP08856034. Publisher Copyright: {\textcopyright} 2021 by the authors. Licensee MDPI, Basel, Switzerland. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.",
year = "2021",
month = mar,
doi = "10.3390/data6030031",
language = "English",
volume = "6",
journal = "Data",
issn = "2306-5729",
publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",
number = "3",

}

RIS

TY - JOUR

T1 - Kaznewsdataset

T2 - Single country overall digital mass media publication corpus

AU - Yakunin, Kirill

AU - Kalimoldayev, Maksat

AU - Mukhamediev, Ravil I.

AU - Mussabayev, Rustam

AU - Barakhnin, Vladimir

AU - Kuchin, Yan

AU - Murzakhmetov, Sanzhar

AU - Buldybayev, Timur

AU - Ospanova, Ulzhan

AU - Yelis, Marina

AU - Zhumabayev, Akylbek

AU - Gopejenko, Viktors

AU - Meirambekkyzy, Zhazirakhanym

AU - Abdurazakov, Alibek

N1 - Funding Information: state development program documents. The corpus was used in several research cases, of state development program documents. The corpus was used in several research such as identification of propaganda, assessment of the sentiment of publications, calcu-cases, such as identification of propaganda, assessment of the sentiment of publications, lation of the level of socially significant negativity, comparative analysis of publication calculation of the level of socially significant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the social significance of news using the topic model of the text corpus since evaluating the social significance of news using the topic model of the text corpus since an an area under receiver operating characteristics curve (ROC AUC) score of 0.81 was area under receiver operating characteristics curve (ROC AUC) score of 0.81 was achieved achieved in the classification task, which is comparable with results obtained for the same in the classification task, which is comparable with results obtained for the same task by task by applying the bidirectional encoder representations from transformers (BERT) applying the bidirectional encoder representations from transformers (BERT) model. The model. The proposed method of identifying texts with propagandistic content was cross-proposed method of identifying texts with propagandistic content was cross-validated on validated on a labeled subsample of 1000 news and showed high predictive power—ROC a labeled subsample of 1000 news and showed high predictive power—ROC AUC 0.73. In AUC 0.73. In the task of sentiment analysis, the proposed method showed a 0.93 ROC the task of sentiment analysis, the proposed method showed a 0.93 ROC AUC score. AUC score. Despite the noted limitations, the corpus will be of interest to researchers analyzing Despite the noted limitations, the corpus will be of interest to researchers analyzing media, including comparative analysis and identification of common patterns inherent in media, including comparative analysis and identification of common patterns inherent in the media of different countries. the media of One of thedifferent codirectionsuntries. of further research of the corpus is the analysis of publication One of the directions of further research of the corpus is the analysis of publication activity related to individual organizations, topics and events, for example, healthcare and the COVID-19 pandemic. Author Contributions: Conceptualization, R.I.M., K.Y. and R.M.; methodology, R.I.M. and R.M.; software, K.Y. and S.M.; validation, V.G., T.B., U.O. and Y.K.; formal analysis, R.I.M.; investigation, software, K.Y. and S.M.; validation, V.G., T.B., U.O. and Y.K.; formal analysis, R.I.M.; investigation, K.Y., V.B., M.Y., Y.K., T.B., U.O., A.Z., V.G., A.A. and S.M.; resources, M.K., R.M.; data curation, K.Y., Z.M., A.A., A.Z. and S.M.; writing—original draft preparation, R.I.M. and K.Y.; writing—review and editing, M.Y., V.B. and K.Y.; visualization, R.I.M. and K.Y.; supervision, R.I.M. and R.M.; project administration, M.K. and R.M.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the Committee of Science under the Ministry of Education and Science of the Republic of Kazakhstan, grant AP08856034. Publisher Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.

PY - 2021/3

Y1 - 2021/3

N2 - Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.

AB - Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.

KW - ARTM

KW - Computer modeling

KW - LDA

KW - Mass-media

KW - Multiple-criteria decision-making (MCDM)

KW - Natural language processing

KW - Propaganda identification

KW - Sentiment analysis

KW - Significant social news

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=85103321272&partnerID=8YFLogxK

U2 - 10.3390/data6030031

DO - 10.3390/data6030031

M3 - Article

AN - SCOPUS:85103321272

VL - 6

JO - Data

JF - Data

SN - 2306-5729

IS - 3

M1 - 31

ER -

ID: 28256899