The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World

Standard

The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World. / Ryabko, Boris; Savina, Nadezhda; Lulu, Yeshewas Getachew et al.

In: Entropy, Vol. 27, No. 10, 1039, 04.10.2025.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{3c87ac814a914a2893f2215c9bfbf2c2,

title = "The Amount of Data Required to Recognize a Writer{\textquoteright}s Style Is Consistent Across Different Languages of the World",

abstract = "In this paper, we apply an information-theoretic method proposed by Ryabko and Savina (therefore called the RS-method), based on the use of data compression, to recognize the individual author{\textquoteright}s style of a writer across four languages from different language groups and families. In this paper, the presented method was used to study fiction texts in Russian (East Slavic group of languages of the Indo-European language family), Amharic (South Ethiosemitic group of the Semitic language family), Chinese (Sinitic group of the Sino-Tibetan language family) and English (West Germanic language group of the Indo-European language family). It was found that the amount of data necessary for recognizing an author{\textquoteright}s style is almost the same for all four languages, i.e., the amount of data is invariant across different language groups. The results obtained are of interest to computer science, literary studies, linguistics and, in particular, computational linguistics.",

keywords = "data compression, hypothesis testing, individual author{\textquoteright}s style of the writer, information technology, information-theoretic method (RS-method), language family, language group",

author = "Boris Ryabko and Nadezhda Savina and Lulu, {Yeshewas Getachew} and Yunfei Han",

note = "The Amount of Data Required to Recognize a Writer{\textquoteright}s Style Is Consistent Across Different Languages of the World / B. Ryabko, N. Savina, Y. G. Lulu, Y. Han // Entropy. - 2025. - Т. 27. № 10. - С. 1039. DOI 10.3390/e27101039 ",

year = "2025",

month = oct,

day = "4",

doi = "10.3390/e27101039",

language = "English",

volume = "27",

journal = "Entropy",

issn = "1099-4300",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "10",

}

RIS

TY - JOUR

T1 - The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World

AU - Ryabko, Boris

AU - Savina, Nadezhda

AU - Lulu, Yeshewas Getachew

AU - Han, Yunfei

N1 - The Amount of Data Required to Recognize a Writer’s Style Is Consistent Across Different Languages of the World / B. Ryabko, N. Savina, Y. G. Lulu, Y. Han // Entropy. - 2025. - Т. 27. № 10. - С. 1039. DOI 10.3390/e27101039

PY - 2025/10/4

Y1 - 2025/10/4

N2 - In this paper, we apply an information-theoretic method proposed by Ryabko and Savina (therefore called the RS-method), based on the use of data compression, to recognize the individual author’s style of a writer across four languages from different language groups and families. In this paper, the presented method was used to study fiction texts in Russian (East Slavic group of languages of the Indo-European language family), Amharic (South Ethiosemitic group of the Semitic language family), Chinese (Sinitic group of the Sino-Tibetan language family) and English (West Germanic language group of the Indo-European language family). It was found that the amount of data necessary for recognizing an author’s style is almost the same for all four languages, i.e., the amount of data is invariant across different language groups. The results obtained are of interest to computer science, literary studies, linguistics and, in particular, computational linguistics.

AB - In this paper, we apply an information-theoretic method proposed by Ryabko and Savina (therefore called the RS-method), based on the use of data compression, to recognize the individual author’s style of a writer across four languages from different language groups and families. In this paper, the presented method was used to study fiction texts in Russian (East Slavic group of languages of the Indo-European language family), Amharic (South Ethiosemitic group of the Semitic language family), Chinese (Sinitic group of the Sino-Tibetan language family) and English (West Germanic language group of the Indo-European language family). It was found that the amount of data necessary for recognizing an author’s style is almost the same for all four languages, i.e., the amount of data is invariant across different language groups. The results obtained are of interest to computer science, literary studies, linguistics and, in particular, computational linguistics.

KW - data compression

KW - hypothesis testing

KW - individual author’s style of the writer

KW - information technology

KW - information-theoretic method (RS-method)

KW - language family

KW - language group

UR - https://www.mendeley.com/catalogue/12f896f4-ce86-345a-a4fc-46d9e5311f70/

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105020280241&origin=inward

U2 - 10.3390/e27101039

DO - 10.3390/e27101039

M3 - Article

C2 - 41148997

VL - 27

JO - Entropy

JF - Entropy

SN - 1099-4300

IS - 10

M1 - 1039

ER -

ID: 71808721