在语言模型中防止逐字记忆给人一种错误的隐私感

论文标题

在语言模型中防止逐字记忆给人一种错误的隐私感

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

论文作者

Ippolito, Daphne, Tramèr, Florian, Nasr, Milad, Zhang, Chiyuan, Jagielski, Matthew, Lee, Katherine, Choquette-Choo, Christopher A., Carlini, Nicholas

论文摘要

在神经语言模型中研究数据记忆有助于我们了解与反对培训数据的模型相关的风险（例如，隐私或版权），并为对策的发展而言。许多先前的作品 - 以及一些最近部署的防御力 - 专注于“逐字记忆”，该记忆定义为模型生成，与训练集的子字符串完全匹配。我们认为，逐字记忆的定义过于限制，无法捕获更微妙的记忆形式。具体而言，我们设计和实施有效的防御，以完全防止所有逐字记忆。但是，我们证明了这种“完美”过滤器并不能阻止训练数据的泄漏。的确，通过合理和最小修改的“样式转移”提示，很容易绕过它，甚至在某些情况下甚至是未修饰的原始提示，也可以提取记忆的信息。我们通过讨论潜在的替代定义以及为什么定义记忆是神经语言模型的一个困难但至关重要的开放问题来得出的结论。

Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization. And yet, we demonstrate that this "perfect" filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified "style-transfer" prompts -- and in some cases even the non-modified original prompts -- to extract memorized information. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题