SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gaël Richard

[Code | Paper]

Abstract: Generative adversarial network (GAN) models can synthesize high-quality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator’s task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.

Contents

LJSpeech
VCTK
MAPS
ENST-Drums
References

LJSpeech


Ground-Truth HiFi-GAN UnivNet (lr=1e-4) UnivNet (lr=2e-4)

StandardDiff-GAN SpecDiff-GAN BigVGAN (lr=1e-4) BigVGAN (lr=2e-4)

VCTK


Ground-Truth HiFi-GAN UnivNet (lr=1e-4) UnivNet (lr=2e-4)

StandardDiff-GAN SpecDiff-GAN BigVGAN (lr=1e-4) BigVGAN (lr=2e-4)

MAPS


Ground-Truth HiFi-GAN StandardDiff-GAN SpecDiff-GAN BigVGAN

ENST-Drums


Ground-Truth HiFi-GAN StandardDiff-GAN SpecDiff-GAN BigVGAN

References

[1] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020.
[2] S. gil Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
[3] W. Jang, D. C. Y. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. Interspeech, 2021.
[4] K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[5] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), Tech. Rep., 2019.
[6] V. Emiya, N. Bertin, B. David, and R. Badeau, “MAPS - a piano database for multipitch estimation and automatic transcription of music,” INRIA, Tech. Rep., Jul. 2010.
[7] O. Gillet and G. Richard, “ENST-Drums: An extensive audio-visual database for drum signals processing,” in Proc. ISMIR, 2006.