通过使用多模式自我监督的嵌入，视听语音增强和分离

论文标题

通过使用多模式自我监督的嵌入，视听语音增强和分离

Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

论文作者

Chern, I-Chun, Hung, Kuo-Hsuan, Chen, Yi-Ting, Hussain, Tassadaq, Gogate, Mandar, Hussain, Amir, Tsao, Yu, Hou, Jen-Cheng

论文摘要

Av-Hubert是一种多模式的自我监督学习模型，已被证明对自动语音识别和唇部阅读等分类问题有效。这表明可以通过利用多模式自我监管的嵌入来获得有用的视听语音表示。然而，目前尚不清楚是否可以将这种表示形式推广以解决现实世界中的多模式AV回归任务，例如音频语音增强（AVSE）和视听语音分离（AVSS）。在这项研究中，我们利用了预先训练的AV-Hubert模型，然后是AVSE和AVSS的SE模块。比较实验结果表明，我们所提出的模型的性能优于最先进的AVSE和传统的音频SE模型。总而言之，我们的结果通过适当的微调策略证实了我们提出的模型对AVS任务的有效性，这表明从AV-Hubert获得的多模式自我监督的嵌入可以推广到视听回归任务。

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题