论文标题
有些单词比其他单词更多吗?
Are Some Words Worth More than Others?
论文作者
论文摘要
与参考基础真理相比,目前的语言建模和发电的评估指标在很大程度上取决于预测(或生成)单词的准确性。尽管很重要,但令牌级的准确性只能捕获语言模型行为的一个方面,而忽略了单词的语言特性,这些词可能会使某些错误预测的令牌在实践中很有用。此外,与预测准确性直接相关的统计数据(包括困惑)可能会被书面语言的Zipfian性质混淆,因为大多数预测尝试将发生在经常出现的类型中。高频和低频单词之间的模型性能可能会大不相同,实际上,这可能会导致失败模式,例如由语言模型的下游消费者产生的重复性和沉闷产生的文本。为了解决这个问题,我们在一个简单的单词预测任务的框架内提出了两项新的内在评估措施,这些措施旨在为语言模型的性能提供更全面的图片。我们使用我们的拟议指标评估了几种常用的大型英语模型,并证明我们的方法揭示了模型之间的性能差异,这些模型被更传统的指标掩盖。
Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.