Standard

Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization. / Kolonin, Anton.

Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023. Institute of Electrical and Electronics Engineers (IEEE), 2023. p. 1430-1435.

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Harvard

Kolonin, A 2023, Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization. in Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023. Institute of Electrical and Electronics Engineers (IEEE), pp. 1430-1435, 16th IEEE International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, Новосибирск, Russian Federation, 10.11.2023. https://doi.org/10.1109/apeie59731.2023.10347856

APA

Kolonin, A. (2023). Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization. In Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023 (pp. 1430-1435). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/apeie59731.2023.10347856

Vancouver

Kolonin A. Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization. In Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023. Institute of Electrical and Electronics Engineers (IEEE). 2023. p. 1430-1435 doi: 10.1109/apeie59731.2023.10347856

Author

Kolonin, Anton. / Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization. Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023. Institute of Electrical and Electronics Engineers (IEEE), 2023. pp. 1430-1435

BibTeX

@inproceedings{b5f23140c5a64a319e350ca10ca55077,
title = "Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization",
abstract = "We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F 1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.",
author = "Anton Kolonin",
note = "We are grateful to Sergey Terekhov and Nikolay Mikhaylovskiy for valuable questions, critique, recommendations and suggestions during the course of work.The paper preprint is available as https://arxiv.org/pdf/2303.02427.pdf.; 16th IEEE International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023 ; Conference date: 10-11-2023 Through 12-11-2023",
year = "2023",
doi = "10.1109/apeie59731.2023.10347856",
language = "English",
isbn = "9798350330885",
pages = "1430--1435",
booktitle = "Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023",
publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

}

RIS

TY - GEN

T1 - Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization

AU - Kolonin, Anton

N1 - Conference code: 16

PY - 2023

Y1 - 2023

N2 - We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F 1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.

AB - We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F 1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.

UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85182275819&origin=inward&txGid=c34cb0dcb7749364a60b2dd0881c55c4

UR - https://www.mendeley.com/catalogue/be1acb13-b5b1-3ad6-8143-5937fba1f494/

U2 - 10.1109/apeie59731.2023.10347856

DO - 10.1109/apeie59731.2023.10347856

M3 - Conference contribution

SN - 9798350330885

SP - 1430

EP - 1435

BT - Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023

PB - Institute of Electrical and Electronics Engineers (IEEE)

T2 - 16th IEEE International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering

Y2 - 10 November 2023 through 12 November 2023

ER -

ID: 59610388