Research output: Contribution to journal › Article › peer-review
Kaznewsdataset : Single country overall digital mass media publication corpus. / Yakunin, Kirill; Kalimoldayev, Maksat; Mukhamediev, Ravil I. et al.
In: Data, Vol. 6, No. 3, 31, 03.2021.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Kaznewsdataset
T2 - Single country overall digital mass media publication corpus
AU - Yakunin, Kirill
AU - Kalimoldayev, Maksat
AU - Mukhamediev, Ravil I.
AU - Mussabayev, Rustam
AU - Barakhnin, Vladimir
AU - Kuchin, Yan
AU - Murzakhmetov, Sanzhar
AU - Buldybayev, Timur
AU - Ospanova, Ulzhan
AU - Yelis, Marina
AU - Zhumabayev, Akylbek
AU - Gopejenko, Viktors
AU - Meirambekkyzy, Zhazirakhanym
AU - Abdurazakov, Alibek
N1 - Funding Information: state development program documents. The corpus was used in several research cases, of state development program documents. The corpus was used in several research such as identification of propaganda, assessment of the sentiment of publications, calcu-cases, such as identification of propaganda, assessment of the sentiment of publications, lation of the level of socially significant negativity, comparative analysis of publication calculation of the level of socially significant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the social significance of news using the topic model of the text corpus since evaluating the social significance of news using the topic model of the text corpus since an an area under receiver operating characteristics curve (ROC AUC) score of 0.81 was area under receiver operating characteristics curve (ROC AUC) score of 0.81 was achieved achieved in the classification task, which is comparable with results obtained for the same in the classification task, which is comparable with results obtained for the same task by task by applying the bidirectional encoder representations from transformers (BERT) applying the bidirectional encoder representations from transformers (BERT) model. The model. The proposed method of identifying texts with propagandistic content was cross-proposed method of identifying texts with propagandistic content was cross-validated on validated on a labeled subsample of 1000 news and showed high predictive power—ROC a labeled subsample of 1000 news and showed high predictive power—ROC AUC 0.73. In AUC 0.73. In the task of sentiment analysis, the proposed method showed a 0.93 ROC the task of sentiment analysis, the proposed method showed a 0.93 ROC AUC score. AUC score. Despite the noted limitations, the corpus will be of interest to researchers analyzing Despite the noted limitations, the corpus will be of interest to researchers analyzing media, including comparative analysis and identification of common patterns inherent in media, including comparative analysis and identification of common patterns inherent in the media of different countries. the media of One of thedifferent codirectionsuntries. of further research of the corpus is the analysis of publication One of the directions of further research of the corpus is the analysis of publication activity related to individual organizations, topics and events, for example, healthcare and the COVID-19 pandemic. Author Contributions: Conceptualization, R.I.M., K.Y. and R.M.; methodology, R.I.M. and R.M.; software, K.Y. and S.M.; validation, V.G., T.B., U.O. and Y.K.; formal analysis, R.I.M.; investigation, software, K.Y. and S.M.; validation, V.G., T.B., U.O. and Y.K.; formal analysis, R.I.M.; investigation, K.Y., V.B., M.Y., Y.K., T.B., U.O., A.Z., V.G., A.A. and S.M.; resources, M.K., R.M.; data curation, K.Y., Z.M., A.A., A.Z. and S.M.; writing—original draft preparation, R.I.M. and K.Y.; writing—review and editing, M.Y., V.B. and K.Y.; visualization, R.I.M. and K.Y.; supervision, R.I.M. and R.M.; project administration, M.K. and R.M.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the Committee of Science under the Ministry of Education and Science of the Republic of Kazakhstan, grant AP08856034. Publisher Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. Copyright: Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/3
Y1 - 2021/3
N2 - Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.
AB - Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.
KW - ARTM
KW - Computer modeling
KW - LDA
KW - Mass-media
KW - Multiple-criteria decision-making (MCDM)
KW - Natural language processing
KW - Propaganda identification
KW - Sentiment analysis
KW - Significant social news
KW - Topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85103321272&partnerID=8YFLogxK
U2 - 10.3390/data6030031
DO - 10.3390/data6030031
M3 - Article
AN - SCOPUS:85103321272
VL - 6
JO - Data
JF - Data
SN - 2306-5729
IS - 3
M1 - 31
ER -
ID: 28256899