Pisets: A Robust Speech Recognition System for Lectures and Interviews

Standard

Pisets: A Robust Speech Recognition System for Lectures and Interviews. / Bondarenko, Ivan; Grebenkin, Daniil; Sedukhin, Oleg et al.

Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025. ed. / Weizhu Chen; Yi Yang; Mohammad Kachuee; Xue-Yong Fu. Association for Computational Linguistics, 2025. p. 988-997 (Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025; Vol. 3).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review

Harvard

Bondarenko, I, Grebenkin, D, Sedukhin, O, Klementev, M, Derunets, R & Budneva, L 2025, Pisets: A Robust Speech Recognition System for Lectures and Interviews. in W Chen, Y Yang, M Kachuee & X-Y Fu (eds), Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025. Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025, vol. 3, Association for Computational Linguistics, pp. 988-997, 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, Albuquerque, New Mexico, United States, 29.04.2025. https://doi.org/10.18653/v1/2025.naacl-industry.74

APA

Bondarenko, I., Grebenkin, D., Sedukhin, O., Klementev, M., Derunets, R., & Budneva, L. (2025). Pisets: A Robust Speech Recognition System for Lectures and Interviews. In W. Chen, Y. Yang, M. Kachuee, & X-Y. Fu (Eds.), Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025 (pp. 988-997). (Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025; Vol. 3). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-industry.74

Vancouver

Bondarenko I, Grebenkin D, Sedukhin O, Klementev M, Derunets R, Budneva L. Pisets: A Robust Speech Recognition System for Lectures and Interviews. In Chen W, Yang Y, Kachuee M, Fu X-Y, editors, Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025. Association for Computational Linguistics. 2025. p. 988-997. (Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025). doi: 10.18653/v1/2025.naacl-industry.74

Author

Bondarenko, Ivan ; Grebenkin, Daniil ; Sedukhin, Oleg et al. / Pisets: A Robust Speech Recognition System for Lectures and Interviews. Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025. editor / Weizhu Chen ; Yi Yang ; Mohammad Kachuee ; Xue-Yong Fu. Association for Computational Linguistics, 2025. pp. 988-997 (Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025).

BibTeX

@inproceedings{da091e1123cc485fa1c0303daf2b8ea4,

title = "Pisets: A Robust Speech Recognition System for Lectures and Interviews",

abstract = "This work presents a speech-to-text system {"}Pisets{"} for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of {"}Pisets{"} system is publicly available at GitHub: https://github.com/bond005/pisets.",

author = "Ivan Bondarenko and Daniil Grebenkin and Oleg Sedukhin and Mikhail Klementev and Roman Derunets and Lyudmila Budneva",

note = "Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Mikhail Klementev, Roman Derunets, and Lyudmila Budneva. 2025. Pisets: A Robust Speech Recognition System for Lectures and Interviews. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 988–997, Albuquerque, New Mexico. Association for Computational Linguistics. The work is supported by the grant for the implementation of the strategic academic leadership program {"}Priority 2030{"} at Novosibirsk State University.; 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL ; Conference date: 29-04-2025 Through 04-05-2025",

year = "2025",

doi = "10.18653/v1/2025.naacl-industry.74",

language = "English",

isbn = "9798891761940",

series = "Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025",

publisher = "Association for Computational Linguistics",

pages = "988--997",

editor = "Weizhu Chen and Yi Yang and Mohammad Kachuee and Xue-Yong Fu",

booktitle = "Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025",

address = "United States",

}

RIS

TY - GEN

T1 - Pisets: A Robust Speech Recognition System for Lectures and Interviews

AU - Bondarenko, Ivan

AU - Grebenkin, Daniil

AU - Sedukhin, Oleg

AU - Klementev, Mikhail

AU - Derunets, Roman

AU - Budneva, Lyudmila

N1 - Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Mikhail Klementev, Roman Derunets, and Lyudmila Budneva. 2025. Pisets: A Robust Speech Recognition System for Lectures and Interviews. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 988–997, Albuquerque, New Mexico. Association for Computational Linguistics. The work is supported by the grant for the implementation of the strategic academic leadership program "Priority 2030" at Novosibirsk State University.

PY - 2025

Y1 - 2025

N2 - This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

AB - This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

UR - https://www.scopus.com/pages/publications/105027153282

UR - https://www.mendeley.com/catalogue/74454c3d-06fc-34dc-a3a0-724cb42896d7/

U2 - 10.18653/v1/2025.naacl-industry.74

DO - 10.18653/v1/2025.naacl-industry.74

M3 - Conference contribution

SN - 9798891761940

T3 - Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025

SP - 988

EP - 997

BT - Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025

A2 - Chen, Weizhu

A2 - Yang, Yi

A2 - Kachuee, Mohammad

A2 - Fu, Xue-Yong

PB - Association for Computational Linguistics

T2 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics

Y2 - 29 April 2025 through 4 May 2025

ER -

ID: 75461856