有监督的说话者将混合嵌入在两个扬声器环境中

论文标题

有监督的说话者将混合嵌入在两个扬声器环境中

Supervised Speaker Embedding De-Mixing in Two-Speaker Environment

论文作者

Shi, Yanpei, Hain, Thomas

论文摘要

将不同的扬声器属性与多演讲者环境分开是具有挑战性的。提出了嵌入解混合方法的扬声器，而不是在信号空间中分离两个扬声器信号。所提出的方法将不同的扬声器特性与嵌入空间中的两个扬声器信号分开。提出的方法包含两个步骤。在第一步中，干净的扬声器嵌入是由剩余TDNN网络学习和收集的。在第二步中，两个扬声器的信号和一个扬声器的嵌入都是输入嵌入混合网络的扬声器的输入。训练混合网络通过重建损失来生成其他说话者的嵌入。扬声器识别精度和清洁嵌入和嵌入嵌入之间的余弦相似性得分用于评估所获得的嵌入的质量。实验有两种数据：人工增强的两扬声器数据（TIMIT）和两个扬声器数据的现实世界记录（MC-WSJ）。研究了六个不同的说话者嵌入解混合体系结构的人。与清洁扬声器嵌入的性能相比，获得的结果表明，所提出的架构之一获得了近距离性能，达到96.9％的识别精度和0.89余弦的相似性。

Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the clean speaker embeddings are learned and collected by a residual TDNN based network. In step two, the two-speaker signal and the embedding of one of the speakers are both input to a speaker embedding de-mixing network. The de-mixing network is trained to generate the embedding of the other speaker by reconstruction loss. Speaker identification accuracy and the cosine similarity score between the clean embeddings and the de-mixed embeddings are used to evaluate the quality of the obtained embeddings. Experiments are done in two kind of data: artificial augmented two-speaker data (TIMIT) and real world recording of two-speaker data (MC-WSJ). Six different speaker embedding de-mixing architectures are investigated. Comparing with the performance on the clean speaker embeddings, the obtained results show that one of the proposed architectures obtained close performance, reaching 96.9% identification accuracy and 0.89 cosine similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题