基于深度学习的视听多演讲者DOA使用无排序损失函数

论文标题

基于深度学习的视听多演讲者DOA使用无排序损失函数

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

论文作者

Wang, Qing, Chen, Hang, Jiang, Ya, Wang, Zhe, Wang, Yuyang, Du, Jun, Lee, Chin-Hui

论文摘要

在本文中，我们通过使用无排序的损失函数提出了一个基于深度学习的到达估算（DOA）估计的多扬声器方向（DOA）。我们首先收集一个用于多模式声源本地化（SSL）的数据集，其中在现实生活中录制了音频和视觉信号。然后，我们提出了一种新型的空间注释方法，以根据针孔摄像机模型在摄像头坐标和像素坐标之间的转换来为每个说话者提供DOA的地面真相。通过空间位置信息作为另一个输入以及声学特征，可以将多扬声器DOA估计作为主动扬声器检测的分类任务解决。由于每个说话者的位置用作输入，因此将解决多扬声器相关任务中的标签置换问题。在模拟数据和实际数据上进行的实验表明，提出的视听DOA估计模型的表现仅优于音频DOA估计模型。

In this paper, we propose a deep learning based multi-speaker direction of arrival (DOA) estimation with audio and visual signals by using permutation-free loss function. We first collect a data set for multi-modal sound source localization (SSL) where both audio and visual signals are recorded in real-life home TV scenarios. Then we propose a novel spatial annotation method to produce the ground truth of DOA for each speaker with the video data by transformation between camera coordinate and pixel coordinate according to the pin-hole camera model. With spatial location information served as another input along with acoustic feature, multi-speaker DOA estimation could be solved as a classification task of active speaker detection. Label permutation problem in multi-speaker related tasks will be addressed since the locations of each speaker are used as input. Experiments conducted on both simulated data and real data show that the proposed audio-visual DOA estimation model outperforms audio-only DOA estimation model by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题