Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
Hapax legomena via stochastic processes. / Файзуллаев, Шахзод Шухрат угли; Ковалевский, Артем Павлович.
в: Glottometrics, Том 56, 30.07.2024, стр. 22-39.Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование
}
TY - JOUR
T1 - Hapax legomena via stochastic processes
AU - Файзуллаев, Шахзод Шухрат угли
AU - Ковалевский, Артем Павлович
N1 - Институт математики им. С.Л. Соболева СО РАН FWNF-2022-0010
PY - 2024/7/30
Y1 - 2024/7/30
N2 - We study the number of words that occur exactly once since the beginning of a text. We model it as a stochastic process over the length of the text. The elementary probability model, going back to Bahadur and Karlin, states that the number of words that occur exactly once should grow according to a power law, like the number of different words. The final value of the number of words occurring exactly once is the number of hapaxes of this text. We construct two statistical tests to test Karlin's model under the assumption that the probabilities of words in this model satisfy the generalized Zipf's law. These statistical tests show that some texts fit the model well, but many texts deviate significantly from it. This deviation is that the number of hapaxes is too small relative to the number of different words.
AB - We study the number of words that occur exactly once since the beginning of a text. We model it as a stochastic process over the length of the text. The elementary probability model, going back to Bahadur and Karlin, states that the number of words that occur exactly once should grow according to a power law, like the number of different words. The final value of the number of words occurring exactly once is the number of hapaxes of this text. We construct two statistical tests to test Karlin's model under the assumption that the probabilities of words in this model satisfy the generalized Zipf's law. These statistical tests show that some texts fit the model well, but many texts deviate significantly from it. This deviation is that the number of hapaxes is too small relative to the number of different words.
KW - limit theorems
KW - mathematical expectation
KW - statistical test
KW - Zipf’s law
KW - limit theorems
KW - mathematical expectation
KW - statistical test
KW - Zipf’s law
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:001274055400002
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85200776162&origin=inward&txGid=d0a0070f66785cee2d6b16ea146d22bd
U2 - https://doi.org/10.53482/2024_56_415
DO - https://doi.org/10.53482/2024_56_415
M3 - Article
VL - 56
SP - 22
EP - 39
JO - Glottometrics
JF - Glottometrics
SN - 2625-8226
ER -
ID: 61237572