Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Standard

Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition. / Yakovenko, Olga; Bondarenko, Ivan.

Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings. ed. / Wil M. van der Aalst; Vladimir Batagelj; Alexey Buzmakov; Dmitry I. Ignatov; Anna Kalenkova; Michael Khachay; Olessia Koltsova; Andrey Kutuzov; Sergei O. Kuznetsov; Irina A. Lomazova; Natalia Loukachevitch; Ilya Makarov; Amedeo Napoli; Alexander Panchenko; Panos M. Pardalos; Marcello Pelillo; Andrey V. Savchenko; Elena Tutubalina. Springer, 2021. p. 115-126 (Communications in Computer and Information Science; Vol. 1357 CCIS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review

Harvard

Yakovenko, O & Bondarenko, I 2021, Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition. in WM van der Aalst, V Batagelj, A Buzmakov, DI Ignatov, A Kalenkova, M Khachay, O Koltsova, A Kutuzov, SO Kuznetsov, IA Lomazova, N Loukachevitch, I Makarov, A Napoli, A Panchenko, PM Pardalos, M Pelillo, AV Savchenko & E Tutubalina (eds), Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings. Communications in Computer and Information Science, vol. 1357 CCIS, Springer, pp. 115-126, 9th International Conference on Analysis of Images, Social Networks, and Texts, Virtual, Online, Russian Federation, 15.10.2020. https://doi.org/10.1007/978-3-030-71214-3_10

APA

Yakovenko, O., & Bondarenko, I. (2021). Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition. In W. M. van der Aalst, V. Batagelj, A. Buzmakov, D. I. Ignatov, A. Kalenkova, M. Khachay, O. Koltsova, A. Kutuzov, S. O. Kuznetsov, I. A. Lomazova, N. Loukachevitch, I. Makarov, A. Napoli, A. Panchenko, P. M. Pardalos, M. Pelillo, A. V. Savchenko, & E. Tutubalina (Eds.), Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings (pp. 115-126). (Communications in Computer and Information Science; Vol. 1357 CCIS). Springer. https://doi.org/10.1007/978-3-030-71214-3_10

Vancouver

Yakovenko O, Bondarenko I. Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition. In van der Aalst WM, Batagelj V, Buzmakov A, Ignatov DI, Kalenkova A, Khachay M, Koltsova O, Kutuzov A, Kuznetsov SO, Lomazova IA, Loukachevitch N, Makarov I, Napoli A, Panchenko A, Pardalos PM, Pelillo M, Savchenko AV, Tutubalina E, editors, Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings. Springer. 2021. p. 115-126. (Communications in Computer and Information Science). doi: 10.1007/978-3-030-71214-3_10

Author

Yakovenko, Olga ; Bondarenko, Ivan. / Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition. Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings. editor / Wil M. van der Aalst ; Vladimir Batagelj ; Alexey Buzmakov ; Dmitry I. Ignatov ; Anna Kalenkova ; Michael Khachay ; Olessia Koltsova ; Andrey Kutuzov ; Sergei O. Kuznetsov ; Irina A. Lomazova ; Natalia Loukachevitch ; Ilya Makarov ; Amedeo Napoli ; Alexander Panchenko ; Panos M. Pardalos ; Marcello Pelillo ; Andrey V. Savchenko ; Elena Tutubalina. Springer, 2021. pp. 115-126 (Communications in Computer and Information Science).

BibTeX

@inproceedings{470a3888063940ea86e516ed9b9cb3ff,

title = "Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition",

abstract = "For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.",

keywords = "Audio feature representation, Speech recognition, Variational autoencoder",

author = "Olga Yakovenko and Ivan Bondarenko",

note = "Publisher Copyright: {\textcopyright} 2021, Springer Nature Switzerland AG.; 9th International Conference on Analysis of Images, Social Networks, and Texts, AIST 2020 ; Conference date: 15-10-2020 Through 16-10-2020",

year = "2021",

doi = "10.1007/978-3-030-71214-3_10",

language = "English",

isbn = "9783030712136",

series = "Communications in Computer and Information Science",

publisher = "Springer",

pages = "115--126",

editor = "{van der Aalst}, {Wil M.} and Vladimir Batagelj and Alexey Buzmakov and Ignatov, {Dmitry I.} and Anna Kalenkova and Michael Khachay and Olessia Koltsova and Andrey Kutuzov and Kuznetsov, {Sergei O.} and Lomazova, {Irina A.} and Natalia Loukachevitch and Ilya Makarov and Amedeo Napoli and Alexander Panchenko and Pardalos, {Panos M.} and Marcello Pelillo and Savchenko, {Andrey V.} and Elena Tutubalina",

booktitle = "Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings",

address = "United States",

}

RIS

TY - GEN

T1 - Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

AU - Yakovenko, Olga

AU - Bondarenko, Ivan

N1 - Conference code: 9

PY - 2021

Y1 - 2021

N2 - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

AB - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

KW - Audio feature representation

KW - Speech recognition

KW - Variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85107369094&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/12b4bcd5-1d41-327a-a2a6-1d7e835514b7/

U2 - 10.1007/978-3-030-71214-3_10

DO - 10.1007/978-3-030-71214-3_10

M3 - Conference contribution

AN - SCOPUS:85107369094

SN - 9783030712136

T3 - Communications in Computer and Information Science

SP - 115

EP - 126

BT - Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings

A2 - van der Aalst, Wil M.

A2 - Batagelj, Vladimir

A2 - Buzmakov, Alexey

A2 - Ignatov, Dmitry I.

A2 - Kalenkova, Anna

A2 - Khachay, Michael

A2 - Koltsova, Olessia

A2 - Kutuzov, Andrey

A2 - Kuznetsov, Sergei O.

A2 - Lomazova, Irina A.

A2 - Loukachevitch, Natalia

A2 - Makarov, Ilya

A2 - Napoli, Amedeo

A2 - Panchenko, Alexander

A2 - Pardalos, Panos M.

A2 - Pelillo, Marcello

A2 - Savchenko, Andrey V.

A2 - Tutubalina, Elena

PB - Springer

T2 - 9th International Conference on Analysis of Images, Social Networks, and Texts

Y2 - 15 October 2020 through 16 October 2020

ER -

ID: 34128065