High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Standard

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram. / Pavlovskiy, Evgeniy; Sheng, Leyuan; Huang, Dong-Yan.

2019.

Research output: Working paper

BibTeX

@techreport{7d2774b08de9462bb05d9b7f6da1b3fa,

title = "High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram",

abstract = "In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.",

author = "Evgeniy Pavlovskiy and Leyuan Sheng and Dong-Yan Huang",

year = "2019",

month = dec,

day = "3",

language = "English",

type = "WorkingPaper",

}

RIS

TY - UNPB

T1 - High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

AU - Pavlovskiy, Evgeniy

AU - Sheng, Leyuan

AU - Huang, Dong-Yan

PY - 2019/12/3

Y1 - 2019/12/3

N2 - In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.

AB - In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.

M3 - Working paper

BT - High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

ER -

ID: 23059076