论文标题
关于自动指标在文本生成系统中的有效性
On the Effectiveness of Automated Metrics for Text Generation Systems
论文作者
论文摘要
文本生成领域的一个主要挑战是评估,因为我们缺乏可以利用的理论来提取评估运动的准则。在这项工作中,我们提出了迈向这种理论的第一步,该理论结合了不同的不确定性来源,例如不完美的自动指标和尺寸不足的测试集。该理论具有实际的应用,例如确定可靠地区分给定设置中一组文本生成系统所需的样本数量。我们展示了该理论在WMT 21和现场评估数据上的应用,并概述了如何利用它来改善评估协议有关评估结果的可靠性,鲁棒性和意义。
A major challenge in the field of Text Generation is evaluation because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.