基于联合特征表示的视听语音分离，并具有交叉模式的注意力

论文标题

基于联合特征表示的视听语音分离，并具有交叉模式的注意力

Audio-visual speech separation based on joint feature representation with cross-modal attention

论文作者

Xiong, Junwen, Zhang, Peng, Xie, Lei, Huang, Wei, Zha, Yufei, Zhang, Yanning

论文摘要

基于多模式的语音分离已经在隔离多对话噪声环境中的目标特征方面具有特定的优势。不幸的是，大多数当前的分离策略都喜欢基于每种单一模式的特征学习的直接融合，这远非足够考虑模态之间的相互关系。在从音频和视觉流进行关注机制的学习联合特征表示的启发下，在这项研究中提出了一种新型的跨模式融合策略，以使整个框架具有不同方式之间的语义相关性。为了进一步改善视听语音分离，唇部运动的致密光流被合并，以增强视觉表示的鲁棒性。对拟议工作的评估是对两个公共音频语音分离基准数据集进行的。性能的总体改进表明，附加的运动网络有效地增强了组合唇部图像和音频信号的视觉表示，并且与所提出的跨模式融合有关所有指标的表现都超过了基线。

Multi-modal based speech separation has exhibited a specific advantage on isolating the target character in multi-talker noisy environments. Unfortunately, most of current separation strategies prefer a straightforward fusion based on feature learning of each single modality, which is far from sufficient consideration of inter-relationships between modalites. Inspired by learning joint feature representations from audio and visual streams with attention mechanism, in this study, a novel cross-modal fusion strategy is proposed to benefit the whole framework with semantic correlations between different modalities. To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated to strengthen the robustness of visual representation. The evaluation of the proposed work is performed on two public audio-visual speech separation benchmark datasets. The overall improvement of the performance has demonstrated that the additional motion network effectively enhances the visual representation of the combined lip images and audio signal, as well as outperforming the baseline in terms of all metrics with the proposed cross-modal fusion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题