Voxceleb扬声器识别挑战2020的Microsoft扬声器诊断系统

论文标题

Voxceleb扬声器识别挑战2020的Microsoft扬声器诊断系统

Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

论文作者

Xiao, Xiong, Kanda, Naoyuki, Chen, Zhuo, Zhou, Tianyan, Yoshioka, Takuya, Chen, Sanyuan, Zhao, Yong, Liu, Gang, Wu, Yu, Wu, Jian, Liu, Shujie, Li, Jinyu, Gong, Yifan

论文摘要

本文介绍了Microsoft扬声器诊断系统，用于在野外进行单膜的多人录音，并在Voxceleb扬声器识别挑战（VOXSRC）2020的诊断轨道上进行了评估。我们将首先解释我们的系统设计，以解决处理实际多对待录音的问题。然后，我们介绍组件的详细信息，其中包括基于RES2NET的扬声器嵌入提取器，基于构象异构体的连续语音分离和泄漏过滤的连续语音分离以及用于系统融合的修改后的Dover（用于诊断输出投票误差的缩写）方法。我们使用VoxSrcChallenge 2020提供的数据集评估了系统，该数据集包含从YouTube收集的现实生活中的多算音频。我们的最佳系统分别在开发集和评估集上达到了诊断错误率（DER）的3.71％和6.23％，分别在挑战的诊断轨道上排名第一。

This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding extractor, conformer-based continuous speech separation with leakage filtering, and a modified DOVER (short for Diarization Output Voting Error Reduction) method for system fusion. We evaluate the systems with the data set provided by VoxSRCchallenge 2020, which contains real-life multi-talker audio collected from YouTube. Our best system achieves 3.71% and 6.23% of the diarization error rate (DER) on development set and evaluation set, respectively, being ranked the 1st at the diarization track of the challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题