Unsupervised Tokenization Learning › Обзор исследований

Standard

Unsupervised Tokenization Learning. / Kolonin, Anton; Ramesh, Vignav.

2022. Работа представлена на 2022 Conference on Empirical Methods in Natural Language Processing, Абу-Даби, Объединенные Арабские Эмираты.

Результаты исследований: Материалы конференций › материалы › Рецензирование

Harvard

Kolonin, A & Ramesh, V 2022, 'Unsupervised Tokenization Learning', Работа представлена на 2022 Conference on Empirical Methods in Natural Language Processing, Абу-Даби, Объединенные Арабские Эмираты, 07.12.2022 - 11.12.2023. https://doi.org/10.18653/v1/2022.emnlp-main.239

APA

Kolonin, A., & Ramesh, V. (2022). Unsupervised Tokenization Learning. Работа представлена на 2022 Conference on Empirical Methods in Natural Language Processing, Абу-Даби, Объединенные Арабские Эмираты. https://doi.org/10.18653/v1/2022.emnlp-main.239

Vancouver

Kolonin A, Ramesh V. Unsupervised Tokenization Learning. 2022. Работа представлена на 2022 Conference on Empirical Methods in Natural Language Processing, Абу-Даби, Объединенные Арабские Эмираты. doi: 10.18653/v1/2022.emnlp-main.239

Author

Kolonin, Anton ; Ramesh, Vignav. / Unsupervised Tokenization Learning. Работа представлена на 2022 Conference on Empirical Methods in Natural Language Processing, Абу-Даби, Объединенные Арабские Эмираты.16 стр.

BibTeX

@conference{99a34c026c4d474fab8fe549fd58893a,

title = "Unsupervised Tokenization Learning",

abstract = "In the presented study, we discover that the so-called “transition freedom” metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and “peak values”) for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.",

author = "Anton Kolonin and Vignav Ramesh",

year = "2022",

doi = "10.18653/v1/2022.emnlp-main.239",

language = "English",

note = "2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2023",

}

RIS

TY - CONF

T1 - Unsupervised Tokenization Learning

AU - Kolonin, Anton

AU - Ramesh, Vignav

PY - 2022

Y1 - 2022

N2 - In the presented study, we discover that the so-called “transition freedom” metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and “peak values”) for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.

AB - In the presented study, we discover that the so-called “transition freedom” metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and “peak values”) for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.

UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85149438240&origin=inward&txGid=fc05dad2329d756230392a5e18aba688

UR - https://www.mendeley.com/catalogue/1872a78e-654b-330c-9b2b-518308d21af9/

U2 - 10.18653/v1/2022.emnlp-main.239

DO - 10.18653/v1/2022.emnlp-main.239

M3 - Paper

T2 - 2022 Conference on Empirical Methods in Natural Language Processing

Y2 - 7 December 2022 through 11 December 2023

ER -

ID: 55718343