Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts

Standard

Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts. / Kudryavtseva, Anna; Kovalevskii, Artyom.

In: Glottometrics, Vol. 58, 2025, p. 19-34.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{233f2977dec0468b8d29fe58d3df3fa7,

title = "Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts",

abstract = "We classify texts using relative word frequencies. The task is to distinguish human-written texts from those generated by a computer using modern algorithms. We study two essay datasets, each containing an equal number of human-written and computer-generated essays. Studying Zipf diagrams shows that the generated texts have a significantly smaller vocabulary compared to human ones. However, the relative frequency of rare words (not included in the 1000 most common) does not allow us to confidently classify the texts. As additional features, we used the relative frequencies of the four most frequent words, as well as the ratio of the number of hapax legomena to the total number of different words. This feature allows to significantly improve the classification. Using these six features allows us to fairly confidently determine whether the text is computer-generated.",

keywords = "Large Language Model, Zipf{\textquoteright}s Law, rare words",

author = "Anna Kudryavtseva and Artyom Kovalevskii",

note = " Weare grateful to an anonymous reviewer for his remarks that helped us improve the paper. The work is supported by Program of Fundamental Scientific Research of the SB RAS, project FWNF-2022-0010. Kudryavtseva Anna, Kovalevskii Artyom Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts / Anna Kudryavtseva, Artyom Kovalevskii // Glottometrics. – 2024. – Vol. 58. – P. 19-34. – DOI 10.53482/2025_58_423 ",

year = "2025",

doi = "10.53482/2025_58_423",

language = "English",

volume = "58",

pages = "19--34",

journal = "Glottometrics",

issn = "2625-8226",

publisher = "International Quantitative Linguistics Association",

}

RIS

TY - JOUR

T1 - Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts

AU - Kudryavtseva, Anna

AU - Kovalevskii, Artyom

N1 - Weare grateful to an anonymous reviewer for his remarks that helped us improve the paper. The work is supported by Program of Fundamental Scientific Research of the SB RAS, project FWNF-2022-0010. Kudryavtseva Anna, Kovalevskii Artyom Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts / Anna Kudryavtseva, Artyom Kovalevskii // Glottometrics. – 2024. – Vol. 58. – P. 19-34. – DOI 10.53482/2025_58_423

PY - 2025

Y1 - 2025

N2 - We classify texts using relative word frequencies. The task is to distinguish human-written texts from those generated by a computer using modern algorithms. We study two essay datasets, each containing an equal number of human-written and computer-generated essays. Studying Zipf diagrams shows that the generated texts have a significantly smaller vocabulary compared to human ones. However, the relative frequency of rare words (not included in the 1000 most common) does not allow us to confidently classify the texts. As additional features, we used the relative frequencies of the four most frequent words, as well as the ratio of the number of hapax legomena to the total number of different words. This feature allows to significantly improve the classification. Using these six features allows us to fairly confidently determine whether the text is computer-generated.

AB - We classify texts using relative word frequencies. The task is to distinguish human-written texts from those generated by a computer using modern algorithms. We study two essay datasets, each containing an equal number of human-written and computer-generated essays. Studying Zipf diagrams shows that the generated texts have a significantly smaller vocabulary compared to human ones. However, the relative frequency of rare words (not included in the 1000 most common) does not allow us to confidently classify the texts. As additional features, we used the relative frequencies of the four most frequent words, as well as the ratio of the number of hapax legomena to the total number of different words. This feature allows to significantly improve the classification. Using these six features allows us to fairly confidently determine whether the text is computer-generated.

KW - Large Language Model

KW - Zipf’s Law

KW - rare words

UR - https://www.mendeley.com/catalogue/0648edcf-baf3-31f8-9e82-7f95b1edc74f/

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=105013211258&origin=inward

U2 - 10.53482/2025_58_423

DO - 10.53482/2025_58_423

M3 - Article

VL - 58

SP - 19

EP - 34

JO - Glottometrics

JF - Glottometrics

SN - 2625-8226

ER -

ID: 68828506