论文标题
Storm:基于扩散的随机再生模型,用于语音增强和脊椎
StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation
论文作者
论文摘要
扩散模型表现出很大的能力,可以弥合预测性和生成方法之间的性能差距以增强语音。我们已经表明,他们甚至可能在非添加腐败类型或在不匹配的条件下评估它们时甚至胜过其预测性对应物。但是,扩散模型遭受了高计算负担,主要是因为它们需要为每个反向扩散步骤运行神经网络,而预测方法仅需要一个通过。由于扩散模型是生成的方法,它们也可能在不利条件下产生发声和呼吸。相比之下,在这种困难的情况下,预测模型通常不会产生此类文物,而是倾向于扭曲目标语音,从而降低语音质量。在这项工作中,我们提出了一种随机再生方法,其中为预测模型提供了进一步扩散的指南。我们表明,所提出的方法使用预测模型来消除发声和呼吸伪像,同时由于扩散模型,即使在不利条件下也产生了非常高质量的样本。我们进一步表明,这种方法使使用更少的扩散步骤的较轻的采样方案而不牺牲质量,从而使计算负担提高了数量级。源代码和音频示例可在线获得(https://uhh.de/inf-sp-storm)。
Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online (https://uhh.de/inf-sp-storm).