Research output: Contribution to journal › Article › peer-review
Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward. / Abebe, Berhane; Chebunin, Mikhail; Kovalevskii, Artyom.
In: Journal of Quantitative Linguistics, 2023, p. 1-18.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward
AU - Abebe, Berhane
AU - Chebunin, Mikhail
AU - Kovalevskii, Artyom
N1 - The work is supported partially by the Fundamental scientific research of the SB RAS, project FWNF-2022-0010.
PY - 2023
Y1 - 2023
N2 - The paper is developing a new statistical approach to automatic partitioning of texts into parts belonging to different authors. It is based on the analysis of processes that counts the number of different words forward and backward. The theoretical study of the processes is based on the assumptions of an elementary probability model with a change point. We prove consistence of our statistical estimate of the point of concatenation in the case when the concatenated texts have different Zipf exponents. This method is being tested on the Brown corpus and also on newspaper texts in different languages. Testing shows a good estimate of the concatenation point. This method can be used in parallel with other text segmentation methods.
AB - The paper is developing a new statistical approach to automatic partitioning of texts into parts belonging to different authors. It is based on the analysis of processes that counts the number of different words forward and backward. The theoretical study of the processes is based on the assumptions of an elementary probability model with a change point. We prove consistence of our statistical estimate of the point of concatenation in the case when the concatenated texts have different Zipf exponents. This method is being tested on the Brown corpus and also on newspaper texts in different languages. Testing shows a good estimate of the concatenation point. This method can be used in parallel with other text segmentation methods.
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85176726465&origin=inward&txGid=81ee0a8d108e42aef8e36e5ecce0dd9d
UR - https://www.mendeley.com/catalogue/3cc22079-a7cf-3038-aaa1-33b068c90bde/
U2 - 10.1080/09296174.2023.2275342
DO - 10.1080/09296174.2023.2275342
M3 - Article
SP - 1
EP - 18
JO - Journal of Quantitative Linguistics
JF - Journal of Quantitative Linguistics
SN - 0929-6174
ER -
ID: 59232960