Standard

A statistical test for correspondence of texts to the Zipf-Mandelbrot law. / Chakrabarty, A.; Chebunin, M. G.; Kovalevskii, A. P. et al.

In: Siberian Electronic Mathematical Reports, Vol. 17, 130, 2020, p. 1959-1974.

Research output: Contribution to journalArticlepeer-review

Harvard

Chakrabarty, A, Chebunin, MG, Kovalevskii, AP, Pupyshev, IM, Zakrevskaya, NS & Zhou, Q 2020, 'A statistical test for correspondence of texts to the Zipf-Mandelbrot law', Siberian Electronic Mathematical Reports, vol. 17, 130, pp. 1959-1974. https://doi.org/10.33048/semi.2020.17.132

APA

Chakrabarty, A., Chebunin, M. G., Kovalevskii, A. P., Pupyshev, I. M., Zakrevskaya, N. S., & Zhou, Q. (2020). A statistical test for correspondence of texts to the Zipf-Mandelbrot law. Siberian Electronic Mathematical Reports, 17, 1959-1974. [130]. https://doi.org/10.33048/semi.2020.17.132

Vancouver

Chakrabarty A, Chebunin MG, Kovalevskii AP, Pupyshev IM, Zakrevskaya NS, Zhou Q. A statistical test for correspondence of texts to the Zipf-Mandelbrot law. Siberian Electronic Mathematical Reports. 2020;17:1959-1974. 130. doi: 10.33048/semi.2020.17.132

Author

Chakrabarty, A. ; Chebunin, M. G. ; Kovalevskii, A. P. et al. / A statistical test for correspondence of texts to the Zipf-Mandelbrot law. In: Siberian Electronic Mathematical Reports. 2020 ; Vol. 17. pp. 1959-1974.

BibTeX

@article{eb6f44e928e1401f8e911b2358e8a39f,
title = "A statistical test for correspondence of texts to the Zipf-Mandelbrot law",
abstract = "We analyse correspondence of texts to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary, and the probability distribution of words corresponds to the Zipf—Mandelbrot law. We count the numbers of different words in the text sequentially and get the process of the numbers of different words. Then we estimate the Zipf—Mandelbrot law{\textquoteright}s parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on C(0,1) to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process. We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.",
keywords = "Gaussian process, weak convergence, Zipf{\textquoteright}s law",
author = "A. Chakrabarty and Chebunin, {M. G.} and Kovalevskii, {A. P.} and Pupyshev, {I. M.} and Zakrevskaya, {N. S.} and Q. Zhou",
note = "Funding Information: Chakrabarty, A., Chebunin, M.G., Kovalevskii, A.P., Pupyshev, I.M., Zakrevskaya, N.S., Zhou, Q., A statistical test for correspondence of texts to the Zipf Mandelbrot law. {\textcopyright} 2020 Chakrabarty A., Chebunin M.G., Kovalevskii A.P., Pupyshev I.M., Zakrevskaya N.S., Zhou Q. The reported study was funded by RFBR and NSFC according to the research project No. 19-51-53010. Received September, 28, 2020, published November, 27, 2020. Funding Information: Acknowledgements The research was supported by RFBR grant 19-51-53010. The authors would like to thank Sergey Foss and an anonimous referee for helpful and constructive comments and suggestions. Publisher Copyright: {\textcopyright} 2020 Chakrabarty A., Chebunin M.G., Kovalevskii A.P., Pupyshev I.M., Zakrevskaya N.S., Zhou Q. All Rights Reserved.",
year = "2020",
doi = "10.33048/semi.2020.17.132",
language = "English",
volume = "17",
pages = "1959--1974",
journal = "Сибирские электронные математические известия",
issn = "1813-3304",
publisher = "Sobolev Institute of Mathematics",

}

RIS

TY - JOUR

T1 - A statistical test for correspondence of texts to the Zipf-Mandelbrot law

AU - Chakrabarty, A.

AU - Chebunin, M. G.

AU - Kovalevskii, A. P.

AU - Pupyshev, I. M.

AU - Zakrevskaya, N. S.

AU - Zhou, Q.

N1 - Funding Information: Chakrabarty, A., Chebunin, M.G., Kovalevskii, A.P., Pupyshev, I.M., Zakrevskaya, N.S., Zhou, Q., A statistical test for correspondence of texts to the Zipf Mandelbrot law. © 2020 Chakrabarty A., Chebunin M.G., Kovalevskii A.P., Pupyshev I.M., Zakrevskaya N.S., Zhou Q. The reported study was funded by RFBR and NSFC according to the research project No. 19-51-53010. Received September, 28, 2020, published November, 27, 2020. Funding Information: Acknowledgements The research was supported by RFBR grant 19-51-53010. The authors would like to thank Sergey Foss and an anonimous referee for helpful and constructive comments and suggestions. Publisher Copyright: © 2020 Chakrabarty A., Chebunin M.G., Kovalevskii A.P., Pupyshev I.M., Zakrevskaya N.S., Zhou Q. All Rights Reserved.

PY - 2020

Y1 - 2020

N2 - We analyse correspondence of texts to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary, and the probability distribution of words corresponds to the Zipf—Mandelbrot law. We count the numbers of different words in the text sequentially and get the process of the numbers of different words. Then we estimate the Zipf—Mandelbrot law’s parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on C(0,1) to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process. We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.

AB - We analyse correspondence of texts to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary, and the probability distribution of words corresponds to the Zipf—Mandelbrot law. We count the numbers of different words in the text sequentially and get the process of the numbers of different words. Then we estimate the Zipf—Mandelbrot law’s parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on C(0,1) to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process. We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.

KW - Gaussian process

KW - weak convergence

KW - Zipf’s law

UR - http://www.scopus.com/inward/record.url?scp=85110828307&partnerID=8YFLogxK

UR - https://elibrary.ru/item.asp?id=44726643

U2 - 10.33048/semi.2020.17.132

DO - 10.33048/semi.2020.17.132

M3 - Article

AN - SCOPUS:85110828307

VL - 17

SP - 1959

EP - 1974

JO - Сибирские электронные математические известия

JF - Сибирские электронные математические известия

SN - 1813-3304

M1 - 130

ER -

ID: 34241386