Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. / Yakovenko, Olga; Bondarenko, Ivan.
Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings. ed. / Carlos Martín-Vide; Miguel A. Vega-Rodríguez; Miin-Shen Yang. Springer Science and Business Media Deutschland GmbH, 2020. p. 54-66 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12494 LNCS).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Research › peer-review
}
TY - GEN
T1 - Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems
AU - Yakovenko, Olga
AU - Bondarenko, Ivan
N1 - Publisher Copyright: © 2020, Springer Nature Switzerland AG. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020
Y1 - 2020
N2 - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
AB - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
KW - Audio feature representation
KW - Speech recognition
KW - Variational Autoencoder
UR - http://www.scopus.com/inward/record.url?scp=85097652298&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-63000-3_5
DO - 10.1007/978-3-030-63000-3_5
M3 - Conference contribution
AN - SCOPUS:85097652298
SN - 9783030629991
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 54
EP - 66
BT - Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings
A2 - Martín-Vide, Carlos
A2 - Vega-Rodríguez, Miguel A.
A2 - Yang, Miin-Shen
PB - Springer Science and Business Media Deutschland GmbH
T2 - 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020
Y2 - 7 December 2020 through 9 December 2020
ER -
ID: 27546907