CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models
Dongsuk Yook, Geonhee Han, Hyung-Pil Chang and In-Chul Yoo
Abstract
Voice conversion (VC) refers to the technique of modifying one speaker’s voice to mimic another’s while retaining the original linguistic content. This technology finds its applications in fields such as speech synthesis, accent modification, medicine, security, privacy, and entertainment. Among the various deep generative models used for voice conversion, including variational autoencoders (VAEs) and generative adversarial networks (GANs), diffusion models (DMs) have recently gained attention as promising methods due to their training stability and strong performance in data generation. Nevertheless, traditional DMs focus mainly on learning reconstruction paths like VAEs, rather than conversion paths as GANs do, thereby restricting the quality of the converted speech. To overcome this limitation and enhance voice conversion performance, we propose a cycle-consistent diffusion (CycleDiffusion) model, which comprises two DMs: one for converting the source speaker’s voice to the target speaker’s voice and the other for converting it back to the source speaker’s voice. By employing two DMs and enforcing a cycle consistency loss, the CycleDiffusion model effectively learns both reconstruction and conversion paths, producing high-quality converted speech. The effectiveness of the proposed model in voice conversion is validated through experiments using the VCTK (Voice Cloning Toolkit) dataset
Full paper is available at https://www.mdpi.com/2076-3417/14/20/9595
Samples
Audio samples are taken from the VCTK data set [1].
A. Audio samples
Some samples are presented in the table below.
Source | Target | DiffVC | CycleDiffusion | |
F1-F2 | Audio Player | Audio Player | Audio Player | Audio Player |
F1-M1 | Audio Player | Audio Player | Audio Player | Audio Player |
F1-M2 | Audio Player | Audio Player | Audio Player | Audio Player |
F2-F1 | Audio Player | Audio Player | Audio Player | Audio Player |
F2-M1 | Audio Player | Audio Player | Audio Player | Audio Player |
F2-M2 | Audio Player | Audio Player | Audio Player | Audio Player |
M1-F1 | Audio Player | Audio Player | Audio Player | Audio Player |
M1-F2 | Audio Player | Audio Player | Audio Player | Audio Player |
M1-M2 | Audio Player | Audio Player | Audio Player | Audio Player |
M2-F1 | Audio Player | Audio Player | Audio Player | Audio Player |
M2-F2 | Audio Player | Audio Player | Audio Player | Audio Player |
M2-M1 | Audio Player | Audio Player | Audio Player | Audio Player |
B. Spectrogram samples
Sample spectrograms of the utterances converted by DiffVC and CycleDiffusion. As demonstrated in the figure, the spectrograms of the utterances processed by CycleDiffusion show more distinct and well-defined formant structures in comparison to those generated by DiffVC.
DiffVC | CycleDiffusion | |
Best case | ![]() |
![]() |
Worst case | ![]() |
![]() |