基于DNN的多演讲者TTS的人类扬声器适应

论文标题

基于DNN的多演讲者TTS的人类扬声器适应

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

论文作者

Udagawa, Kenta, Saito, Yuki, Saruwatari, Hiroshi

论文摘要

本文提出了一种用于多演讲者文本到语音的人类扬声器适应方法。使用传统的说话者适应方法，使用对扬声器 - 歧义任务训练的说话者编码器，从其参考语音中提取了目标扬声器的嵌入向量。但是，当参考语音不可用时，该方法无法获得目标扬声器的嵌入向量。我们的方法基于一个人类的优化框架，该框架结合了用户来探索扬声器 - 安装空间以查找目标扬声器的嵌入。所提出的方法使用顺序线搜索算法，该算法反复要求用户在嵌入空间中的线段上选择一个点。为了有效地从多个刺激中选择最佳的语音样本，我们还开发了一个系统，在该系统中，用户可以在每个音素的声音之间切换在循环发音的同时。实验结果表明，即使不直接将参考语音用作说话者编码器的输入，提出的方法也可以在客观和主观评估中实现与常规评估相当的性能。

This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each phoneme while looping an utterance. Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations even if reference speech is not used as the input of a speaker encoder directly.

下载PDF全文

下载文献需遵守相关版权规定

论文标题