论文标题
开放域聊天机器人中安全的食谱
Recipes for Safety in Open-domain Chatbots
论文作者
论文摘要
在大型未标记的人类互动中培训的模型将学习模式和模仿行为,其中包括进攻或其他有毒的行为和不必要的偏见。我们研究了各种在开放域生成对话模型的情况下减轻这些问题的方法。我们为训练更安全的模型和评估它们介绍了一个新的人类和模型框架,以及一种新颖的方法,可以在生成模型中提炼安全考虑,而无需在部署时间内使用外部分类器。我们进行了比较这些方法的实验,并发现我们的新技术比通过自动和人类评估来衡量的现有模型更安全,而(ii)保持可用性指标,例如相对于技术状态的参与度。然后,我们通过分析模型的故障案例来讨论这项工作的局限性。
Models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. We investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. We introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. We conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. We then discuss the limitations of this work by analyzing failure cases of our models.