用于基于GAN的非自动入学的TTS的多尺度时间频谱谱图

论文标题

用于基于GAN的非自动入学的TTS的多尺度时间频谱谱图

A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS

论文作者

Guo, Haohan, Lu, Hui, Wu, Xixin, Meng, Helen

论文摘要

生成对抗网络（GAN）通过使用额外的模型来区分真实语音和所产生的语音，通过对抗训练它来改善非自动进取的TTS（NAR-TTS）方面表现出了出色的能力。为了最大程度地提高甘恩的好处，至关重要的是找到一个强大的歧视者可以捕获丰富的可区分信息。在本文中，我们提出了一个多尺度的时频谱图鉴别器，以帮助NAR-TTS生成高保真性MEL光谱图。它将频谱图视为2D图像，以利用时频域中不同组件之间的相关性。并采用基于U-NET的模型结构来区分不同的尺度，以捕获粗粒和细粒度的信息。我们进行主观测试以评估所提出的方法。多尺度和时频歧视都可以显着改善自然性和忠诚度。在结合神经声码器时，它表现出比对声码器进行微调更有效和简洁。最后，我们可视化区分图以比较它们的差异，以验证多尺度区分的有效性。

The generative adversarial network (GAN) has shown its outstanding capability in improving Non-Autoregressive TTS (NAR-TTS) by adversarially training it with an extra model that discriminates between the real and the generated speech. To maximize the benefits of GAN, it is crucial to find a powerful discriminator that can capture rich distinguishable information. In this paper, we propose a multi-scale time-frequency spectrogram discriminator to help NAR-TTS generate high-fidelity Mel-spectrograms. It treats the spectrogram as a 2D image to exploit the correlation among different components in the time-frequency domain. And a U-Net-based model structure is employed to discriminate at different scales to capture both coarse-grained and fine-grained information. We conduct subjective tests to evaluate the proposed approach. Both multi-scale and time-frequency discriminating bring significant improvement in the naturalness and fidelity. When combining the neural vocoder, it is shown more effective and concise than fine-tuning the vocoder. Finally, we visualize the discriminating maps to compare their difference to verify the effectiveness of multi-scale discriminating.

下载PDF全文

下载文献需遵守相关版权规定

论文标题