Rubic2: Ensemble Model for Russian Lemmatization

Standard

Rubic2: Ensemble Model for Russian Lemmatization. / Afanasev, Ilia; Glazkova, Anna; Lyashevskaya, Olga et al.

Proceedings of the Annual Meeting of the Association for Computational Linguistics. ed. / Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar. Association for Computational Linguistics, 2025. p. 157-170 (Proceedings of the Annual Meeting of the Association for Computational Linguistics; Vol. 1).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review

Harvard

Afanasev, I, Glazkova, A, Lyashevskaya, O, Morozov, D, Smal, I & Vlasova, N 2025, Rubic2: Ensemble Model for Russian Lemmatization. in W Che, J Nabende, E Shutova & MT Pilehvar (eds), Proceedings of the Annual Meeting of the Association for Computational Linguistics. Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, Association for Computational Linguistics, pp. 157-170, The 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27.07.2025. https://doi.org/10.18653/v1/2025.bsnlp-1.18

APA

Afanasev, I., Glazkova, A., Lyashevskaya, O., Morozov, D., Smal, I., & Vlasova, N. (2025). Rubic2: Ensemble Model for Russian Lemmatization. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 157-170). (Proceedings of the Annual Meeting of the Association for Computational Linguistics; Vol. 1). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.bsnlp-1.18

Vancouver

Afanasev I, Glazkova A, Lyashevskaya O, Morozov D, Smal I, Vlasova N. Rubic2: Ensemble Model for Russian Lemmatization. In Che W, Nabende J, Shutova E, Pilehvar MT, editors, Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2025. p. 157-170. (Proceedings of the Annual Meeting of the Association for Computational Linguistics). doi: 10.18653/v1/2025.bsnlp-1.18

Author

Afanasev, Ilia ; Glazkova, Anna ; Lyashevskaya, Olga et al. / Rubic2: Ensemble Model for Russian Lemmatization. Proceedings of the Annual Meeting of the Association for Computational Linguistics. editor / Wanxiang Che ; Joyce Nabende ; Ekaterina Shutova ; Mohammad Taher Pilehvar. Association for Computational Linguistics, 2025. pp. 157-170 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

BibTeX

@inproceedings{4b6d1b4801654b7f848bd32e38f9d5a6,

title = "Rubic2: Ensemble Model for Russian Lemmatization",

abstract = "Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.",

author = "Ilia Afanasev and Anna Glazkova and Olga Lyashevskaya and Dmitry Morozov and Ivan Smal and Natalia Vlasova",

note = "Ilia Afanasev, Anna Glazkova, Olga Lyashevskaya, Dmitry Morozov, Ivan Smal, and Natalia Vlasova. 2025. Rubic2: Ensemble Model for Russian Lemmatization. In Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025), pages 157–170, Vienna, Austria. Association for Computational Linguistics.; The 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 ; Conference date: 27-07-2025 Through 01-08-2025",

year = "2025",

month = jul,

doi = "10.18653/v1/2025.bsnlp-1.18",

language = "English",

isbn = "9798891762510",

series = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics",

pages = "157--170",

editor = "Wanxiang Che and Joyce Nabende and Ekaterina Shutova and Pilehvar, {Mohammad Taher}",

booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

address = "United States",

url = "https://2025.aclweb.org/",

}

RIS

TY - GEN

T1 - Rubic2: Ensemble Model for Russian Lemmatization

AU - Afanasev, Ilia

AU - Glazkova, Anna

AU - Lyashevskaya, Olga

AU - Morozov, Dmitry

AU - Smal, Ivan

AU - Vlasova, Natalia

N1 - Conference code: 63

PY - 2025/7

Y1 - 2025/7

N2 - Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.

AB - Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.

UR - https://www.mendeley.com/catalogue/0abd9ac5-ec84-3630-bf23-7f7a861f6cd1/

U2 - 10.18653/v1/2025.bsnlp-1.18

DO - 10.18653/v1/2025.bsnlp-1.18

M3 - Conference contribution

SN - 9798891762510

T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics

SP - 157

EP - 170

BT - Proceedings of the Annual Meeting of the Association for Computational Linguistics

A2 - Che, Wanxiang

A2 - Nabende, Joyce

A2 - Shutova, Ekaterina

A2 - Pilehvar, Mohammad Taher

PB - Association for Computational Linguistics

T2 - The 63rd Annual Meeting of the Association for Computational Linguistics

Y2 - 27 July 2025 through 1 August 2025

ER -

ID: 75470372