Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization. / Kolonin, Anton.
Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023. Institute of Electrical and Electronics Engineers (IEEE), 2023. p. 1430-1435.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Self-tuning Hyper-parameters for Unsupervised Cross-lingual Tokenization
AU - Kolonin, Anton
N1 - Conference code: 16
PY - 2023
Y1 - 2023
N2 - We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F 1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.
AB - We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F 1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85182275819&origin=inward&txGid=c34cb0dcb7749364a60b2dd0881c55c4
UR - https://www.mendeley.com/catalogue/be1acb13-b5b1-3ad6-8143-5937fba1f494/
U2 - 10.1109/apeie59731.2023.10347856
DO - 10.1109/apeie59731.2023.10347856
M3 - Conference contribution
SN - 9798350330885
SP - 1430
EP - 1435
BT - Proceedings of the 2023 IEEE 16th International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering, APEIE 2023
PB - Institute of Electrical and Electronics Engineers (IEEE)
T2 - 16th IEEE International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering
Y2 - 10 November 2023 through 12 November 2023
ER -
ID: 59610388