Research output: Contribution to journal › Article › peer-review
Modifications of Karlin and Simon text models. / Chebunin, M. G.; Kovalevskii, A. P.
In: Siberian Electronic Mathematical Reports, Vol. 19, No. 2, 2022, p. 708-723.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Modifications of Karlin and Simon text models
AU - Chebunin, M. G.
AU - Kovalevskii, A. P.
N1 - Chebunin, M. G. Modifications of Karlin and Simon text MODELS / M. G. Chebunin, A. P. Kovalevskii // Siberian Electronic Mathematical Reports. – 2022. – Vol. 19, No. 2. – P. 708-723. The reported study was funded by RFBR and CNRS according to the research project No. 19-51-15001.
PY - 2022
Y1 - 2022
N2 - We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves repeated words in the Karlin model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.
AB - We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves repeated words in the Karlin model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.
KW - Infinite urn model
KW - Probability text models
KW - Simon model
KW - Weak convergence.
UR - https://www.scopus.com/inward/record.url?eid=2-s2.0-85145993030&partnerID=40&md5=955fb5220aa2a3f564289e31577966f0
UR - https://www.elibrary.ru/item.asp?id=50336845
UR - https://www.mendeley.com/catalogue/58ad6d64-48b8-3112-9f97-4fd871297e90/
U2 - 10.33048/semi.2022.19.059
DO - 10.33048/semi.2022.19.059
M3 - Article
VL - 19
SP - 708
EP - 723
JO - Сибирские электронные математические известия
JF - Сибирские электронные математические известия
SN - 1813-3304
IS - 2
ER -
ID: 45800469