通过有条件生成模型反转的对抗性稳健分类

论文标题

通过有条件生成模型反转的对抗性稳健分类

Adversarially Robust Classification by Conditional Generative Model Inversion

论文作者

Alirezaei, Mitra, Tasdizen, Tolga

论文摘要

大多数对抗性攻击防御方法都依赖于混淆梯度。这些方法成功地捍卫了基于梯度的攻击。但是，它们很容易通过不使用梯度的攻击或通过近似和使用校正后梯度的攻击来规避它们。存在不混淆梯度（例如对抗训练）的防御能力，但是这些方法通常会对攻击（例如其幅度）做出假设。我们提出了一个不会混淆梯度的分类模型，并且在不假定攻击的先验知识的情况下通过构造稳健。我们的方法将分类作为优化问题，在该问题中，我们“反转”了经过不受干扰的自然图像训练的条件发电机，以找到生成最接近查询图像的样本的类。我们假设针对对抗性攻击的潜在脆弱来源是进料前锋分类器的高维质性质，它使对手能够在输入空间中找到很小的扰动，从而导致输出空间发生很大变化。另一方面，生成模型通常是低维度的映射。虽然该方法与防御工具有关，但在我们的模型中使用条件生成模型和反转而不是进料前馈分类器是一个关键差异。与防御能力易于产生易于规避的混淆梯度不同，我们表明我们的方法不会混淆梯度。我们证明，与自然受过训练的喂食前馈分类器相比，我们的模型对黑盒攻击非常强大，并且对白盒子的攻击提高了鲁棒性。

Most adversarial attack defense methods rely on obfuscating gradients. These methods are successful in defending against gradient-based attacks; however, they are easily circumvented by attacks which either do not use the gradient or by attacks which approximate and use the corrected gradient. Defenses that do not obfuscate gradients such as adversarial training exist, but these approaches generally make assumptions about the attack such as its magnitude. We propose a classification model that does not obfuscate gradients and is robust by construction without assuming prior knowledge about the attack. Our method casts classification as an optimization problem where we "invert" a conditional generator trained on unperturbed, natural images to find the class that generates the closest sample to the query image. We hypothesize that a potential source of brittleness against adversarial attacks is the high-to-low-dimensional nature of feed-forward classifiers which allows an adversary to find small perturbations in the input space that lead to large changes in the output space. On the other hand, a generative model is typically a low-to-high-dimensional mapping. While the method is related to Defense-GAN, the use of a conditional generative model and inversion in our model instead of the feed-forward classifier is a critical difference. Unlike Defense-GAN, which was shown to generate obfuscated gradients that are easily circumvented, we show that our method does not obfuscate gradients. We demonstrate that our model is extremely robust against black-box attacks and has improved robustness against white-box attacks compared to naturally trained, feed-forward classifiers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题