论文标题
苏打水:社交常识性上下文化的百万级对话蒸馏
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
论文作者
论文摘要
在开放域社交对话领域,数据稀缺一直是一个长期存在的问题。为了消除这种口渴,我们提出了苏打水:第一个公开可用的百万级高质量社交对话数据集。通过从知识图中将社会常识性知识与社会常识性化,我们能够从大型语言模型中提取出极为广泛的社交互动。人类评估表明,苏打中的对话比以前的人为著名数据集的对话更一致,更具体且(令人惊讶的是)自然。 使用苏打水,我们训练Cosmo:一种可概括的对话模型,在看不见的数据集中比最佳表现的对话模型(例如Godel,Blenderbot-1,Koala,Vicuna)更自然和一致。实验表明,宇宙有时甚至比原始的人写的金反应更喜欢。此外,我们的结果还阐明了富含知识的对话与自然社会核对面之间的区别。我们计划公开我们的数据,模型和代码。
Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets. Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We plan to make our data, model, and code public.