Text complexity and linguistic features: their correlation in English and Russian

Standard

Text complexity and linguistic features: their correlation in English and Russian. / Morozov, Dmitry A.; Glazkova, Anna V.; Iomdin, Boris L.

In: Russian Journal of Linguistics, Vol. 26, No. 2, 7, 2022, p. 426-448.

Research output: Contribution to journal › Article › peer-review

Harvard

Morozov, DA, Glazkova, AV & Iomdin, BL 2022, 'Text complexity and linguistic features: their correlation in English and Russian', Russian Journal of Linguistics, vol. 26, no. 2, 7, pp. 426-448. https://doi.org/10.22363/2687-0088-30132

APA

Morozov, D. A., Glazkova, A. V., & Iomdin, B. L. (2022). Text complexity and linguistic features: their correlation in English and Russian. Russian Journal of Linguistics, 26(2), 426-448. [7]. https://doi.org/10.22363/2687-0088-30132

Vancouver

Morozov DA, Glazkova AV, Iomdin BL. Text complexity and linguistic features: their correlation in English and Russian. Russian Journal of Linguistics. 2022;26(2):426-448. 7. doi: 10.22363/2687-0088-30132

Author

Morozov, Dmitry A. ; Glazkova, Anna V. ; Iomdin, Boris L. / Text complexity and linguistic features: their correlation in English and Russian. In: Russian Journal of Linguistics. 2022 ; Vol. 26, No. 2. pp. 426-448.

BibTeX

@article{687b8cfd60ce409fa327ad1543d28393,

title = "Text complexity and linguistic features: their correlation in English and Russian",

abstract = "Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader{\textquoteright}s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.",

keywords = "corpus linguistics, machine learning, neural network, text complexity",

author = "Morozov, {Dmitry A.} and Glazkova, {Anna V.} and Iomdin, {Boris L.}",

note = "Funding Information: The article was funded by RFBR, project number 19-29-14224. Publisher Copyright: {\textcopyright} Dmitry A. Morozov, Anna V. Glazkova, Boris L. Iomdin, 2022.",

year = "2022",

doi = "10.22363/2687-0088-30132",

language = "English",

volume = "26",

pages = "426--448",

journal = "Russian Journal of Linguistics",

issn = "2687-0088",

publisher = "Издательство: Российский университет дружбы народов (РУДН)",

number = "2",

}

RIS

TY - JOUR

T1 - Text complexity and linguistic features: their correlation in English and Russian

AU - Morozov, Dmitry A.

AU - Glazkova, Anna V.

AU - Iomdin, Boris L.

N1 - Funding Information: The article was funded by RFBR, project number 19-29-14224. Publisher Copyright: © Dmitry A. Morozov, Anna V. Glazkova, Boris L. Iomdin, 2022.

PY - 2022

Y1 - 2022

N2 - Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader’s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.

AB - Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader’s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.

KW - corpus linguistics

KW - machine learning

KW - neural network

KW - text complexity

UR - http://www.scopus.com/inward/record.url?scp=85133614658&partnerID=8YFLogxK

UR - https://www.elibrary.ru/item.asp?id=49174232

UR - https://www.mendeley.com/catalogue/2594c338-c835-3da6-aefc-f401917a986e/

U2 - 10.22363/2687-0088-30132

DO - 10.22363/2687-0088-30132

M3 - Article

AN - SCOPUS:85133614658

VL - 26

SP - 426

EP - 448

JO - Russian Journal of Linguistics

JF - Russian Journal of Linguistics

SN - 2687-0088

IS - 2

M1 - 7

ER -

ID: 36778619