Demo – CycleDiffusion VC

CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Dongsuk Yook, Geonhee Han, Hyung-Pil Chang and In-Chul Yoo

 

Abstract

Voice conversion (VC) refers to the technique of modifying one speaker’s voice to mimic another’s while retaining the original linguistic content. This technology finds its applications in fields such as speech synthesis, accent modification, medicine, security, privacy, and entertainment. Among the various deep generative models used for voice conversion, including variational autoencoders (VAEs) and generative adversarial networks (GANs), diffusion models (DMs) have recently gained attention as promising methods due to their training stability and strong performance in data generation. Nevertheless, traditional DMs focus mainly on learning reconstruction paths like VAEs, rather than conversion paths as GANs do, thereby restricting the quality of the converted speech. To overcome this limitation and enhance voice conversion performance, we propose a cycle-consistent diffusion (CycleDiffusion) model, which comprises two DMs: one for converting the source speaker’s voice to the target speaker’s voice and the other for converting it back to the source speaker’s voice. By employing two DMs and enforcing a cycle consistency loss, the CycleDiffusion model effectively learns both reconstruction and conversion paths, producing high-quality converted speech. The effectiveness of the proposed model in voice conversion is validated through experiments using the VCTK (Voice Cloning Toolkit) dataset

Full paper is available at https://www.mdpi.com/2076-3417/14/20/9595

 

Samples

Audio samples are taken from the VCTK data set [1].

A. Audio samples

Some samples are presented in the table below.

Source Target DiffVC CycleDiffusion
F1-F2 Audio Player Audio Player Audio Player Audio Player
F1-M1 Audio Player Audio Player Audio Player Audio Player
F1-M2 Audio Player Audio Player Audio Player Audio Player
F2-F1 Audio Player Audio Player Audio Player Audio Player
F2-M1 Audio Player Audio Player Audio Player Audio Player
F2-M2 Audio Player Audio Player Audio Player Audio Player
M1-F1 Audio Player Audio Player Audio Player Audio Player
M1-F2 Audio Player Audio Player Audio Player Audio Player
M1-M2 Audio Player Audio Player Audio Player Audio Player
M2-F1 Audio Player Audio Player Audio Player Audio Player
M2-F2 Audio Player Audio Player Audio Player Audio Player
M2-M1 Audio Player Audio Player Audio Player Audio Player

B. Spectrogram samples

Sample spectrograms of the utterances converted by DiffVC and CycleDiffusion. As demonstrated in the figure, the spectrograms of the utterances processed by CycleDiffusion show more distinct and well-defined formant structures in comparison to those generated by DiffVC.

DiffVC CycleDiffusion
Best case
Worst case

 

References

[1] Veaux, Christophe; Yamagishi, Junichi; MacDonald, Kirsten. (2017). CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/1994