Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

Standard

Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. / Yakovenko, Olga; Bondarenko, Ivan.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). ed. / Carlos Martín-Vide; Miguel A. Vega-Rodríguez; Miin-Shen Yang. Springer, 2020. p. 54-66 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12494 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Chapter › Research › peer-review

Harvard

Yakovenko, O & Bondarenko, I 2020, Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. in C Martín-Vide, MA Vega-Rodríguez & M-S Yang (eds), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12494 LNCS, Springer, pp. 54-66, 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020, Taoyuan, Taiwan, Province of China, 07.12.2020. https://doi.org/10.1007/978-3-030-63000-3_5

APA

Yakovenko, O., & Bondarenko, I. (2020). Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. In C. Martín-Vide, M. A. Vega-Rodríguez, & M-S. Yang (Eds.), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (pp. 54-66). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12494 LNCS). Springer. https://doi.org/10.1007/978-3-030-63000-3_5

Vancouver

Yakovenko O, Bondarenko I. Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. In Martín-Vide C, Vega-Rodríguez MA, Yang M-S, editors, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer. 2020. p. 54-66. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-63000-3_5

Author

Yakovenko, Olga ; Bondarenko, Ivan. / Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). editor / Carlos Martín-Vide ; Miguel A. Vega-Rodríguez ; Miin-Shen Yang. Springer, 2020. pp. 54-66 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

BibTeX

@inbook{91df0732ac8b4c88962abc2dea70729d,

title = "Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems",

abstract = "For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.",

keywords = "Audio feature representation, Speech recognition, Variational Autoencoder",

author = "Olga Yakovenko and Ivan Bondarenko",

note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.; 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020 ; Conference date: 07-12-2020 Through 09-12-2020",

year = "2020",

doi = "10.1007/978-3-030-63000-3_5",

language = "English",

isbn = "9783030629991",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "54--66",

editor = "Carlos Mart{\'i}n-Vide and Vega-Rodr{\'i}guez, {Miguel A.} and Miin-Shen Yang",

booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

address = "United States",

}

RIS

TY - CHAP

T1 - Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

AU - Yakovenko, Olga

AU - Bondarenko, Ivan

PY - 2020

Y1 - 2020

N2 - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

AB - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

KW - Audio feature representation

KW - Speech recognition

KW - Variational Autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85097652298&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/0f8850dc-538b-34bc-bd3a-e7589646dcd6/

U2 - 10.1007/978-3-030-63000-3_5

DO - 10.1007/978-3-030-63000-3_5

M3 - Chapter

AN - SCOPUS:85097652298

SN - 9783030629991

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 54

EP - 66

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

A2 - Martín-Vide, Carlos

A2 - Vega-Rodríguez, Miguel A.

A2 - Yang, Miin-Shen

PB - Springer

T2 - 9th International Conference on Theory and Practice of Natural Computing, TPNC 2020

Y2 - 7 December 2020 through 9 December 2020

ER -

ID: 27546907