空间DCCRN：配备框架级角度和混合滤波的DCCRN，用于增强多通道语音

论文标题

空间DCCRN：配备框架级角度和混合滤波的DCCRN，用于增强多通道语音

spatial-dccrn: dccrn equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement

论文作者

Lv, Shubo, Fu, Yihui, Jv, Yukai, Xie, Lei, Zhu, Weixin, Rao, Wei, Wang, Yannan

论文摘要

最近，由于使用空间信息将目标语音与干扰信号区分开，多渠道语音增强功能引起了人们的极大兴趣。为了充分利用空间信息和基于神经网络的掩盖估计，我们提出了一个多通道的denoising神经网络 - 空间DCCRN。首先，我们将S-DCCRN扩展到多渠道方案，旨在执行级联的子渠道和全渠道处理策略，这些策略可以分别对不同的通道进行建模。此外，我们不仅要应用一个模型的输入，而不是仅采用多渠道频谱或将第一通道的幅度和IPD串联，我们还应用了一个角度特征提取模块（AFE）来提取框架级角度特征嵌入式，这可以帮助模型来看出显而易见的空间信息。最后，由于当噪声和语音以相同的时间频率（TF）bin存在时，残留噪声的现象将更加严重，因此我们特别设计一种掩盖和映射过滤方法来替代传统的过滤器和效果操作，目的是使级联的级联脱落，降级，降级，降低了骨化，固定和残余噪声抑制。所提出的模型空间DCCRN超过了EABNET，FASNET以及L3DAS22挑战数据集中的几个竞争模型。不仅3D方案，空间DCCRN在多渠道会议上的多个评估指标中，在多个评估指标中，空间dccrn的表现要优于最先进的模型MIMO-UNET（SOTA）模型MIMO-UNET。消融研究还证明了不同贡献的有效性。

Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded sub-channel and full-channel processing strategy, which can model different channels separately. Moreover, instead of only adopting multi-channel spectrum or concatenating first-channel's magnitude and IPD as the model's inputs, we apply an angle feature extraction module (AFE) to extract frame-level angle feature embeddings, which can help the model to apparently perceive spatial information. Finally, since the phenomenon of residual noise will be more serious when the noise and speech exist in the same time frequency (TF) bin, we particularly design a masking and mapping filtering method to substitute the traditional filter-and-sum operation, with the purpose of cascading coarsely denoising, dereverberation and residual noise suppression. The proposed model, Spatial-DCCRN, has surpassed EaBNet, FasNet as well as several competitive models on the L3DAS22 Challenge dataset. Not only the 3D scenario, Spatial-DCCRN outperforms state-of-the-art (SOTA) model MIMO-UNet by a large margin in multiple evaluation metrics on the multi-channel ConferencingSpeech2021 Challenge dataset. Ablation studies also demonstrate the effectiveness of different contributions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题