Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. / Morozov, D. A.; Smal, I. A.; Garipov, T. A. и др.
в: Automatic Control and Computer Sciences, Том 59, № 7, 12.2025, стр. 929-940.Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
}
TY - JOUR
T1 - Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment
AU - Morozov, D. A.
AU - Smal, I. A.
AU - Garipov, T. A.
AU - Glazkova, A. V.
N1 - Morozov, D.A., Smal, I.A., Garipov, T.A. et al. Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. Aut. Control Comp. Sci. 59, 929–940 (2025). https://doi.org/10.3103/S0146411625700294
PY - 2025/12
Y1 - 2025/12
N2 - The task of assessing the complexity of a text is a relevant applied problem with potential application in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text’s complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this study, we examine three groups of features: (1) automatically generated keywords, (2) information about the features of morphemic word parsing, and (3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm is utilized to generate keywords, a convolutional neural network model is used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, is used to generate syntax trees. We conduct a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and annotation paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text’s complexity. The use of keywords perform worse on average than the use of topic markers obtained using the latent Dirichlet allocation (LDA). In most situations, morphemic characteristics turn out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allows, in most cases, improving the quality of the work of neural network models in comparison with the previously described set.
AB - The task of assessing the complexity of a text is a relevant applied problem with potential application in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text’s complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this study, we examine three groups of features: (1) automatically generated keywords, (2) information about the features of morphemic word parsing, and (3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm is utilized to generate keywords, a convolutional neural network model is used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, is used to generate syntax trees. We conduct a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and annotation paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text’s complexity. The use of keywords perform worse on average than the use of topic markers obtained using the latent Dirichlet allocation (LDA). In most situations, morphemic characteristics turn out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allows, in most cases, improving the quality of the work of neural network models in comparison with the previously described set.
KW - keyword generation
KW - morpheme parsing generation
KW - syntax trees
KW - text complexity
KW - СЛОЖНОСТЬ ТЕКСТА
KW - ГЕНЕРАЦИЯ КЛЮЧЕВЫХ СЛОВ
KW - ГЕНЕРАЦИЯ МОРФЕМНЫХ РАЗБОРОВ
KW - СИНТАКСИЧЕСКИЕ ДЕРЕВЬЯ
UR - https://www.scopus.com/pages/publications/105030608383
UR - https://www.mendeley.com/catalogue/9b8c5658-794c-35c4-958f-875ce61a6dce/
U2 - 10.3103/S0146411625700294
DO - 10.3103/S0146411625700294
M3 - Article
VL - 59
SP - 929
EP - 940
JO - Automatic Control and Computer Sciences
JF - Automatic Control and Computer Sciences
SN - 1558-108X
IS - 7
ER -
ID: 75468542