论文标题
大语言模型中的事件知识:不可能的与不可能的事件知识
Event knowledge in large language models: the gap between the impossible and the unlikely
论文作者
论文摘要
语言语料库中的单词共发生模式包含令人惊讶的概念知识。大型语言模型(LLM)经过培训,可以预测上下文中的单词,并利用这些模式在需要世界知识的各种语义任务上实现令人印象深刻的表现。关于LLMS的语义能力的一个重要但有理由的问题是,它们是否获得了对共同事件的广泛知识。在这里,我们测试了五个预先训练的LLM(从2018年的BERT到2023年的MPT)是否为代理 - 患者相互作用的合理描述分配了更高的可能性,而不是对同一事件的最小不同的令人难以置信的版本。使用三组策划的最小句子对(总n = 1,215),我们发现预训练的LLM具有实质性事件知识,表现优于其他分布语言模型。特别是,他们几乎总是将更高的可能性分配给可能的事件,而不是不可能的事件(老师购买了笔记本电脑与笔记本电脑购买了老师)。但是,LLMS对可能与不太可能的事件的偏好表现出不太一致的偏好(保姆辅导了男孩与男孩辅导保姆)。 In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM表示。总体而言,我们的结果表明,事件知识的重要方面自然来自分布语言模式,但也突出了可能的/不可能的和可能/可能/不太可能事件的表示之间的差距。
Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.