论文标题
双路径的跨模式关注,以提供更好的视听语音提取
Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction
论文作者
论文摘要
视听目标语音提取旨在通过查看唇部运动来从嘈杂的混合物中提取某个说话者的语音,从而结合了时间域的语音分离模型和视觉特征提取器(CNN)取得了重大进展。融合音频和视频信息的一个问题是它们具有不同的时间分辨率。当前的大多数研究都会沿时间维度进行视觉特征,以便音频和视频功能能够随时间对齐。但是,我们认为唇部运动主要包含长期或电话级信息。基于这个假设,我们提出了一种融合视听功能的新方法。我们观察到,对于dprnn \ cite {dprnn},互联维度的时间分辨率可能非常接近视频帧的时间分辨率。像\ cite {sepformer}一样,dprnn中的LSTM被内部内部和牙间的自我注意力所取代,但是在提出的算法中,界界面的注意将视觉特征作为附加特征流。这样可以防止视觉提示的提高采样,从而导致更有效的视听融合。结果表明,与其他基于时间域的视听融合模型相比,我们取得了优越的结果。
Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN \cite{dprnn}, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like \cite{sepformer}, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk self-attention, but in the proposed algorithm, inter-chunk attention incorporates the visual features as an additional feature stream. This prevents the upsampling of visual cues, resulting in more efficient audio-visual fusion. The result shows we achieve superior results compared with other time-domain based audio-visual fusion models.