Standard

Modifications of Karlin and Simon text models. / Chebunin, M. G.; Kovalevskii, A. P.

In: Siberian Electronic Mathematical Reports, Vol. 19, No. 2, 2022, p. 708-723.

Research output: Contribution to journalArticlepeer-review

Harvard

Chebunin, MG & Kovalevskii, AP 2022, 'Modifications of Karlin and Simon text models', Siberian Electronic Mathematical Reports, vol. 19, no. 2, pp. 708-723. https://doi.org/10.33048/semi.2022.19.059

APA

Vancouver

Chebunin MG, Kovalevskii AP. Modifications of Karlin and Simon text models. Siberian Electronic Mathematical Reports. 2022;19(2):708-723. doi: 10.33048/semi.2022.19.059

Author

Chebunin, M. G. ; Kovalevskii, A. P. / Modifications of Karlin and Simon text models. In: Siberian Electronic Mathematical Reports. 2022 ; Vol. 19, No. 2. pp. 708-723.

BibTeX

@article{0ce2d4176db349abb9aa5a5e243f7f2c,
title = "Modifications of Karlin and Simon text models",
abstract = "We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves repeated words in the Karlin model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.",
keywords = "Infinite urn model, Probability text models, Simon model, Weak convergence.",
author = "Chebunin, {M. G.} and Kovalevskii, {A. P.}",
note = "Chebunin, M. G. Modifications of Karlin and Simon text MODELS / M. G. Chebunin, A. P. Kovalevskii // Siberian Electronic Mathematical Reports. – 2022. – Vol. 19, No. 2. – P. 708-723. The reported study was funded by RFBR and CNRS according to the research project No. 19-51-15001.",
year = "2022",
doi = "10.33048/semi.2022.19.059",
language = "English",
volume = "19",
pages = "708--723",
journal = "Сибирские электронные математические известия",
issn = "1813-3304",
publisher = "Sobolev Institute of Mathematics",
number = "2",

}

RIS

TY - JOUR

T1 - Modifications of Karlin and Simon text models

AU - Chebunin, M. G.

AU - Kovalevskii, A. P.

N1 - Chebunin, M. G. Modifications of Karlin and Simon text MODELS / M. G. Chebunin, A. P. Kovalevskii // Siberian Electronic Mathematical Reports. – 2022. – Vol. 19, No. 2. – P. 708-723. The reported study was funded by RFBR and CNRS according to the research project No. 19-51-15001.

PY - 2022

Y1 - 2022

N2 - We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves repeated words in the Karlin model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.

AB - We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves repeated words in the Karlin model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.

KW - Infinite urn model

KW - Probability text models

KW - Simon model

KW - Weak convergence.

UR - https://www.scopus.com/inward/record.url?eid=2-s2.0-85145993030&partnerID=40&md5=955fb5220aa2a3f564289e31577966f0

UR - https://www.elibrary.ru/item.asp?id=50336845

UR - https://www.mendeley.com/catalogue/58ad6d64-48b8-3112-9f97-4fd871297e90/

U2 - 10.33048/semi.2022.19.059

DO - 10.33048/semi.2022.19.059

M3 - Article

VL - 19

SP - 708

EP - 723

JO - Сибирские электронные математические известия

JF - Сибирские электронные математические известия

SN - 1813-3304

IS - 2

ER -

ID: 45800469