Standard

Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. / Yakovenko, Olga; Bondarenko, Ivan.

Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings. ed. / Carlos Martín-Vide; Miguel A. Vega-Rodríguez; Miin-Shen Yang. Springer Science and Business Media Deutschland GmbH, 2020. p. 54-66 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12494 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Harvard

Yakovenko, O & Bondarenko, I 2020, Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. in C Martín-Vide, MA Vega-Rodríguez & M-S Yang (eds), Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12494 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 54-66, 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020, Taoyuan, Taiwan, Province of China, 07.12.2020. https://doi.org/10.1007/978-3-030-63000-3_5

APA

Yakovenko, O., & Bondarenko, I. (2020). Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. In C. Martín-Vide, M. A. Vega-Rodríguez, & M-S. Yang (Eds.), Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings (pp. 54-66). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12494 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-63000-3_5

Vancouver

Yakovenko O, Bondarenko I. Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. In Martín-Vide C, Vega-Rodríguez MA, Yang M-S, editors, Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 54-66. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-63000-3_5

Author

Yakovenko, Olga ; Bondarenko, Ivan. / Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings. editor / Carlos Martín-Vide ; Miguel A. Vega-Rodríguez ; Miin-Shen Yang. Springer Science and Business Media Deutschland GmbH, 2020. pp. 54-66 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

BibTeX

@inproceedings{91df0732ac8b4c88962abc2dea70729d,
title = "Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems",
abstract = "For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.",
keywords = "Audio feature representation, Speech recognition, Variational Autoencoder",
author = "Olga Yakovenko and Ivan Bondarenko",
note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.; 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020 ; Conference date: 07-12-2020 Through 09-12-2020",
year = "2020",
doi = "10.1007/978-3-030-63000-3_5",
language = "English",
isbn = "9783030629991",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Science and Business Media Deutschland GmbH",
pages = "54--66",
editor = "Carlos Mart{\'i}n-Vide and Vega-Rodr{\'i}guez, {Miguel A.} and Miin-Shen Yang",
booktitle = "Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings",
address = "Germany",

}

RIS

TY - GEN

T1 - Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

AU - Yakovenko, Olga

AU - Bondarenko, Ivan

N1 - Publisher Copyright: © 2020, Springer Nature Switzerland AG. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.

PY - 2020

Y1 - 2020

N2 - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

AB - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

KW - Audio feature representation

KW - Speech recognition

KW - Variational Autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85097652298&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-63000-3_5

DO - 10.1007/978-3-030-63000-3_5

M3 - Conference contribution

AN - SCOPUS:85097652298

SN - 9783030629991

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 54

EP - 66

BT - Theory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings

A2 - Martín-Vide, Carlos

A2 - Vega-Rodríguez, Miguel A.

A2 - Yang, Miin-Shen

PB - Springer Science and Business Media Deutschland GmbH

T2 - 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020

Y2 - 7 December 2020 through 9 December 2020

ER -

ID: 27546907