论文标题
探测分类器对于删除和检测不可靠
Probing Classifiers are Unreliable for Concept Removal and Detection
论文作者
论文摘要
已经发现,在文本数据上训练的神经网络模型在其表示中编码了不良的语言或敏感概念。由于概念,文本输入和学识渊博的表示之间存在复杂的关系,因此删除这种概念是不平凡的。最近的工作提出了事后和对抗方法,以从模型的表示中删除此类不需要的概念。通过广泛的理论和经验分析,我们表明这些方法可以适得其反:它们无法完全删除这些概念,在最坏的情况下,可能最终破坏了所有与任务相关的特征。原因是方法对探测分类器的依赖是该概念的代理。即使在一个概念在表示空间中的相关特征可以提供100%准确性时,即使在学习探测分类器的最有利条件下,我们证明探测分类器可能使用非概念概念功能,因此事后或对抗性方法将无法正确删除该概念。这些理论含义通过实验对合成,多NLI和Twitter数据集训练的模型进行了证实。对于诸如公平之类的概念清除的敏感应用,我们建议您谨慎使用这些方法,并提出一个虚假度量标准来评估最终分类器的质量。
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.