学习通过空间区域分开声音

论文标题

学习通过空间区域分开声音

Learning to Separate Voices by Spatial Regions

论文作者

Xu, Zhongweiyang, Choudhury, Romit Roy

论文摘要

我们考虑双耳应用（例如耳机和助听器）的音频语音分离问题。尽管当今的神经网络的表现非常出色（用2美元的麦克风分开$ 4+$来源），他们假设已知或固定的最大数量来源。本文打算放松这两个约束，而牺牲问题定义的略有改变。我们观察到，当接收到的混合物包含过多的来源时，将它们逐个区域（即将信号混合物与用户头部周围的每个圆锥形扇区隔离）仍然很有帮助。这需要学习每个区域的细粒空间特性，包括一个人的头部施加的信号扭曲。我们提出了一个两阶段的自我监督框架，在该框架中，预处理耳机中的偷听声音以提取相对清洁的个性化信号，然后将其用于训练区域分离模型。结果表明，表现出色的表现，强调了个性化对通用监督方法的重要性。（在我们的项目网站上可用的音频样本：https：//uiuc-earable-computing.github.io/binaural/。我们相信，我们相信此结果可以帮助现实世界中的应用程序，以选择性听力，消除噪音和音频增强现实。

We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating $4+$ sources with 2 microphones) they assume a known or fixed maximum number of sources, K. Moreover, today's models are trained in a supervised manner, using training data synthesized from generic sources, environments, and human head shapes. This paper intends to relax both these constraints at the expense of a slight alteration in the problem definition. We observe that, when a received mixture contains too many sources, it is still helpful to separate them by region, i.e., isolating signal mixtures from each conical sector around the user's head. This requires learning the fine-grained spatial properties of each region, including the signal distortions imposed by a person's head. We propose a two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals, which are then used to train a region-wise separation model. Results show promising performance, underscoring the importance of personalization over a generic supervised approach. (audio samples available at our project website: https://uiuc-earable-computing.github.io/binaural/. We believe this result could help real-world applications in selective hearing, noise cancellation, and audio augmented reality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题