论文标题
在自动扬声器验证中进行重播欺骗检测的深层生成变异自动编码
Deep Generative Variational Autoencoding for Replay Spoof Detection in Automatic Speaker Verification
论文作者
论文摘要
自动扬声器验证(ASV)系统非常容易受到演示攻击的影响,也称为欺骗攻击。重播是最简单的攻击之一 - 但难以可靠地检测到。欺骗对策(CMS)的概括失败促使社区研究各种替代深度学习CMS。他们中的大多数是学习人类歧视者的监督方法。在本文中,我们主张一种不同的,深厚的生成方法,该方法利用了强大的无监督分类学习。潜在的好处包括采样新数据的可能性,并了解真实和欺骗语音的潜在特征。为此,我们建议通过三种替代模型使用差异自动编码器(VAE)作为重播攻击检测的替代后端。第一个类似于在欺骗检测中使用高斯混合模型(GMM)的第一个是独立训练两个VAE的训练 - 每个班级一个。第二个是通过向编码器和解码器网络注入单热类标签向量来训练单个条件模型(C-VAE)。我们的最终建议集成了一个辅助分类器,以指导潜在空间的学习。我们使用ASVSPOOF 2017和2019物理访问子任务数据集中的Constant-Q Cepstral系数(CQCC)功能的实验结果表明,与培训每个类别的两个单独的VAE相比,C-VAE提供了可观的改进。在2019年数据集上,C-VAE在相等的错误率(EER)和串联检测成本函数(T-DCF)指标中,均优于VAE和基线GMM的绝对9-10%。最后,我们提出了VAE残差 - 原始输入和重建的绝对差异是欺骗检测的特征。
Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount - yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs - one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals - the absolute difference of the original input and the reconstruction as features for spoofing detection.