Evolution of Efficient Symbolic Communication Codes. / Kolonin, Anton.
Studies in Computational Intelligence. Springer Science and Business Media Deutschland GmbH, 2023. p. 3-12 (Studies in Computational Intelligence; Vol. 1120).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Evolution of Efficient Symbolic Communication Codes
AU - Kolonin, Anton
N1 - We are grateful to Sergey Terekhov and Nikolay Mikhaylovskiy for valuable questions, critique, recommendations and suggestions during the course of work.
PY - 2023
Y1 - 2023
N2 - The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code, targeting maximization of such culture-agnostic and cross-lingual metrics such as anti-entropy, compression factor and cross-split F1 score. The exploration is done as part of a larger unsupervised language learning effort, the attempt is made to perform meta-learning in a space of hyper-parameters maximizing F1 score based on the “ground truth” language structure, by means of maximizing the metrics mentioned above. The paper presents preliminary results of cross-lingual word-level segmentation tokenization study for Russian, Chinese and English as well as subword segmentation or morpho-parsing study for English. It is found that language structure form the word-level segmentation or tokenization can be found as driven by all of these metrics, anti-entropy being more relevant to English and Russian while compression factor more specific for Chinese. The study for subword segmentation or morpho-parsing on English lexicon has revealed straight connection between the compression been found to be associated with compression factor, while, surprising, the same connection with anti-entropy has turned to be the inverse.
AB - The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code, targeting maximization of such culture-agnostic and cross-lingual metrics such as anti-entropy, compression factor and cross-split F1 score. The exploration is done as part of a larger unsupervised language learning effort, the attempt is made to perform meta-learning in a space of hyper-parameters maximizing F1 score based on the “ground truth” language structure, by means of maximizing the metrics mentioned above. The paper presents preliminary results of cross-lingual word-level segmentation tokenization study for Russian, Chinese and English as well as subword segmentation or morpho-parsing study for English. It is found that language structure form the word-level segmentation or tokenization can be found as driven by all of these metrics, anti-entropy being more relevant to English and Russian while compression factor more specific for Chinese. The study for subword segmentation or morpho-parsing on English lexicon has revealed straight connection between the compression been found to be associated with compression factor, while, surprising, the same connection with anti-entropy has turned to be the inverse.
KW - Communication Code
KW - Compression
KW - Cross-lingual
KW - Entropy
KW - Meta-learning
KW - Natural Language
KW - Subword Segmentation
KW - Tokenization
KW - Unsupervised Language Learning
UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85175815154&origin=inward&txGid=ff8763439a02e4b423dba92fe55cb89a
UR - https://www.mendeley.com/catalogue/f2b6373d-6628-35b8-a066-cd733dec2e03/
U2 - 10.1007/978-3-031-44865-2_1
DO - 10.1007/978-3-031-44865-2_1
M3 - Conference contribution
SN - 9783031448645
T3 - Studies in Computational Intelligence
SP - 3
EP - 12
BT - Studies in Computational Intelligence
PB - Springer Science and Business Media Deutschland GmbH
ER -
ID: 59193519