FB-MSTCN：基于多尺度时间卷积网络的全频段单渠道语音增强方法

论文标题

FB-MSTCN：基于多尺度时间卷积网络的全频段单渠道语音增强方法

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

论文作者

Zhang, Zehua, Zhang, Lu, Zhuang, Xuyi, Qian, Yukun, Li, Heng, Wang, Mingjiang

论文摘要

近年来，基于深度学习的方法显着改善了单渠道语音增强的性能。但是，由于训练数据和计算复杂性的局限性，全频段（48 kHz）语音信号的实时增强仍然非常具有挑战性。由于高频部分中光谱信息的能量低，因此很难直接使用神经网络进行建模和增强全频谱光谱。为了解决这个问题，本文提出了一个两阶段的实时语音增强模型，该模型具有用于全带信号的提取交流机制。通过提取，将48 kHz全带时域信号分为三个亚渠道，并提出了“掩盖 +补偿”的两阶段处理方案，以增强复杂域中的信号。两阶段增强后，通过间隔插值恢复增强的全带语音信号。在主观的聆听和单词准确性测试中，我们提出的模型可实现卓越的性能，超过基准模型的总体模型为0.59 mos，而非人性化语音剥夺任务的基线模型则超过了4.0％的WACC。

In recent years, deep learning-based approaches have significantly improved the performance of single-channel speech enhancement. However, due to the limitation of training data and computational complexity, real-time enhancement of full-band (48 kHz) speech signals is still very challenging. Because of the low energy of spectral information in the high-frequency part, it is more difficult to directly model and enhance the full-band spectrum using neural networks. To solve this problem, this paper proposes a two-stage real-time speech enhancement model with extraction-interpolation mechanism for a full-band signal. The 48 kHz full-band time-domain signal is divided into three sub-channels by extracting, and a two-stage processing scheme of `masking + compensation' is proposed to enhance the signal in the complex domain. After the two-stage enhancement, the enhanced full-band speech signal is restored by interval interpolation. In the subjective listening and word accuracy test, our proposed model achieves superior performance and outperforms the baseline model overall by 0.59 MOS and 4.0% WAcc for the non-personalized speech denoising task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题