论文标题
部分可观测时空混沌系统的无模型预测
Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios
论文作者
论文摘要
传统上,重叠的语音诊断被视为多标签分类问题。在本文中,我们通过将多个二进制标签编码为具有功率集的单个标签,将这项任务重新制定为单标签预测问题,这代表了目标扬声器的可能组合。该配方有两个好处。首先,明确对目标扬声器的重叠进行了建模。其次,不再需要阈值选择。通过这种表述,我们提出了说话者嵌入感知的神经诊断(发送)框架,在该框架中,将共同优化语音编码器,一个说话者编码器,两个相似性得分子和后处理网络,以根据语音特征和扬声器嵌入之间的相似性来预测编码的标签。实验结果表明,Send具有稳定的学习过程,可以在高度重叠的数据上进行培训,而无需额外的初始化。更重要的是,我们的方法在具有更少的模型参数和较低的计算复杂性的实际会议场景中实现了最新的性能。
Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are explicitly modeled. Second, threshold selection is no longer needed. Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings. Experimental results show that SEND has a stable learning process and can be trained on highly overlapped data without extra initialization. More importantly, our method achieves the state-of-the-art performance in real meeting scenarios with fewer model parameters and lower computational complexity.