Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment

Standard

Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. / Morozov, D. A.; Smal, I. A.; Garipov, T. A. et al.

In: Automatic Control and Computer Sciences, Vol. 59, No. 7, 12.2025, p. 929-940.

Research output: Contribution to journal › Article › peer-review

Harvard

Morozov, DA, Smal, IA, Garipov, TA & Glazkova, AV 2025, 'Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment', Automatic Control and Computer Sciences, vol. 59, no. 7, pp. 929-940. https://doi.org/10.3103/S0146411625700294

APA

Morozov, D. A., Smal, I. A., Garipov, T. A., & Glazkova, A. V. (2025). Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. Automatic Control and Computer Sciences, 59(7), 929-940. https://doi.org/10.3103/S0146411625700294

Vancouver

Morozov DA, Smal IA, Garipov TA, Glazkova AV. Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. Automatic Control and Computer Sciences. 2025 Dec;59(7):929-940. doi: 10.3103/S0146411625700294

Author

Morozov, D. A. ; Smal, I. A. ; Garipov, T. A. et al. / Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. In: Automatic Control and Computer Sciences. 2025 ; Vol. 59, No. 7. pp. 929-940.

BibTeX

@article{9c2159edd2ad4789b2daccdb9cfe6016,

title = "Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment",

abstract = "The task of assessing the complexity of a text is a relevant applied problem with potential application in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text{\textquoteright}s complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this study, we examine three groups of features: (1) automatically generated keywords, (2) information about the features of morphemic word parsing, and (3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm is utilized to generate keywords, a convolutional neural network model is used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, is used to generate syntax trees. We conduct a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and annotation paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text{\textquoteright}s complexity. The use of keywords perform worse on average than the use of topic markers obtained using the latent Dirichlet allocation (LDA). In most situations, morphemic characteristics turn out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allows, in most cases, improving the quality of the work of neural network models in comparison with the previously described set.",

keywords = "keyword generation, morpheme parsing generation, syntax trees, text complexity, СЛОЖНОСТЬ ТЕКСТА, ГЕНЕРАЦИЯ КЛЮЧЕВЫХ СЛОВ, ГЕНЕРАЦИЯ МОРФЕМНЫХ РАЗБОРОВ, СИНТАКСИЧЕСКИЕ ДЕРЕВЬЯ",

author = "Morozov, {D. A.} and Smal, {I. A.} and Garipov, {T. A.} and Glazkova, {A. V.}",

note = "Morozov, D.A., Smal, I.A., Garipov, T.A. et al. Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. Aut. Control Comp. Sci. 59, 929–940 (2025). https://doi.org/10.3103/S0146411625700294",

year = "2025",

month = dec,

doi = "10.3103/S0146411625700294",

language = "English",

volume = "59",

pages = "929--940",

journal = "Automatic Control and Computer Sciences",

issn = "1558-108X",

publisher = "Allerton Press Inc.",

number = "7",

}

RIS

TY - JOUR

T1 - Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment

AU - Morozov, D. A.

AU - Smal, I. A.

AU - Garipov, T. A.

AU - Glazkova, A. V.

N1 - Morozov, D.A., Smal, I.A., Garipov, T.A. et al. Keywords, Morpheme Parsing, and Syntactic Trees: Features for Text Complexity Assessment. Aut. Control Comp. Sci. 59, 929–940 (2025). https://doi.org/10.3103/S0146411625700294

PY - 2025/12

Y1 - 2025/12

N2 - The task of assessing the complexity of a text is a relevant applied problem with potential application in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text’s complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this study, we examine three groups of features: (1) automatically generated keywords, (2) information about the features of morphemic word parsing, and (3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm is utilized to generate keywords, a convolutional neural network model is used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, is used to generate syntax trees. We conduct a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and annotation paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text’s complexity. The use of keywords perform worse on average than the use of topic markers obtained using the latent Dirichlet allocation (LDA). In most situations, morphemic characteristics turn out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allows, in most cases, improving the quality of the work of neural network models in comparison with the previously described set.

AB - The task of assessing the complexity of a text is a relevant applied problem with potential application in drafting legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text’s complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this study, we examine three groups of features: (1) automatically generated keywords, (2) information about the features of morphemic word parsing, and (3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm is utilized to generate keywords, a convolutional neural network model is used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, is used to generate syntax trees. We conduct a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and annotation paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text’s complexity. The use of keywords perform worse on average than the use of topic markers obtained using the latent Dirichlet allocation (LDA). In most situations, morphemic characteristics turn out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allows, in most cases, improving the quality of the work of neural network models in comparison with the previously described set.

KW - keyword generation

KW - morpheme parsing generation

KW - syntax trees

KW - text complexity

KW - СЛОЖНОСТЬ ТЕКСТА

KW - ГЕНЕРАЦИЯ КЛЮЧЕВЫХ СЛОВ

KW - ГЕНЕРАЦИЯ МОРФЕМНЫХ РАЗБОРОВ

KW - СИНТАКСИЧЕСКИЕ ДЕРЕВЬЯ

UR - https://www.scopus.com/pages/publications/105030608383

UR - https://www.mendeley.com/catalogue/9b8c5658-794c-35c4-958f-875ce61a6dce/

U2 - 10.3103/S0146411625700294

DO - 10.3103/S0146411625700294

M3 - Article

VL - 59

SP - 929

EP - 940

JO - Automatic Control and Computer Sciences

JF - Automatic Control and Computer Sciences

SN - 1558-108X

IS - 7

ER -

ID: 75468542