CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Dongsuk Yook, Geonhee Han, Hyung-Pil Chang and In-Chul Yoo

Abstract

Voice conversion (VC) refers to the technique of modifying one speaker’s voice to mimic another’s while retaining the original linguistic content. This technology finds its applications in fields such as speech synthesis, accent modification, medicine, security, privacy, and entertainment. Among the various deep generative models used for voice conversion, including variational autoencoders (VAEs) and generative adversarial networks (GANs), diffusion models (DMs) have recently gained attention as promising methods due to their training stability and strong performance in data generation. Nevertheless, traditional DMs focus mainly on learning reconstruction paths like VAEs, rather than conversion paths as GANs do, thereby restricting the quality of the converted speech. To overcome this limitation and enhance voice conversion performance, we propose a cycle-consistent diffusion (CycleDiffusion) model, which comprises two DMs: one for converting the source speaker’s voice to the target speaker’s voice and the other for converting it back to the source speaker’s voice. By employing two DMs and enforcing a cycle consistency loss, the CycleDiffusion model effectively learns both reconstruction and conversion paths, producing high-quality converted speech. The effectiveness of the proposed model in voice conversion is validated through experiments using the VCTK (Voice Cloning Toolkit) dataset

Full paper is available at https://www.mdpi.com/2076-3417/14/20/9595

Samples

Audio samples are taken from the VCTK data set [1].

A. Audio samples

Some samples are presented in the table below.

	Source	Target	DiffVC	CycleDiffusion
F1-F2
F1-M1
F1-M2
F2-F1
F2-M1
F2-M2
M1-F1
M1-F2
M1-M2
M2-F1
M2-F2
M2-M1

B. Spectrogram samples

Sample spectrograms of the utterances converted by DiffVC and CycleDiffusion. As demonstrated in the figure, the spectrograms of the utterances processed by CycleDiffusion show more distinct and well-defined formant structures in comparison to those generated by DiffVC.

	DiffVC	CycleDiffusion
Best case
Worst case

References

[1] Veaux, Christophe; Yamagishi, Junichi; MacDonald, Kirsten. (2017). CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/1994