论文标题
强大的一声歌声转换
Robust One-Shot Singing Voice Conversion
论文作者
论文摘要
深层生成模型的最新进展提高了语音域中语音转换的质量。然而,由于音调,响度和发音的多种多样的音乐表达方式,看不见的歌手的高质量歌声转换(SVC)仍然具有挑战性。此外,唱歌的声音通常是通过混响和伴奏音乐录制的,这使SVC更具挑战性。在这项工作中,我们提出了一个强大的单发SVC(ROSVC),即使在这种扭曲的唱歌声音上,也可以强劲地执行任何一对一的SVC。为此,我们首先提出了一种基于生成对抗网络的单发SVC模型,该模型通过部分域调节概括地将歌手概括为看不见的歌手,并学会通过俯仰分布匹配和Adain-skip条件准确地恢复目标音高。然后,我们提出了一种称为Robustify的两阶段训练方法,该方法在清洁数据的第一阶段中训练单发SVC模型,以确保高质量的转换,并在第二阶段向模型的编码器引入增强模块,以增强从扭曲的歌曲声音中提高功能提取。为了进一步提高语音质量和音高重建精度,我们最终提出了一个用于唱歌语音神经声码器的层次扩散模型。实验结果表明,所提出的方法在可见和看不见的歌手方面优于最先进的SVC基线,并显着提高了针对扭曲的鲁棒性。
Recent progress in deep generative models has improved the quality of voice conversion in the speech domain. However, high-quality singing voice conversion (SVC) of unseen singers remains challenging due to the wider variety of musical expressions in pitch, loudness, and pronunciation. Moreover, singing voices are often recorded with reverb and accompaniment music, which make SVC even more challenging. In this work, we present a robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voices. To this end, we first propose a one-shot SVC model based on generative adversarial networks that generalizes to unseen singers via partial domain conditioning and learns to accurately recover the target pitch via pitch distribution matching and AdaIN-skip conditioning. We then propose a two-stage training method called Robustify that train the one-shot SVC model in the first stage on clean data to ensure high-quality conversion, and introduces enhancement modules to the encoders of the model in the second stage to enhance the feature extraction from distorted singing voices. To further improve the voice quality and pitch reconstruction accuracy, we finally propose a hierarchical diffusion model for singing voice neural vocoders. Experimental results show that the proposed method outperforms state-of-the-art one-shot SVC baselines for both seen and unseen singers and significantly improves the robustness against distortions.