重新访问黄金标准：通过强大的人类评估进行基础摘要评估

论文标题

重新访问黄金标准：通过强大的人类评估进行基础摘要评估

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

论文作者

Liu, Yixin, Fabbri, Alexander R., Liu, Pengfei, Zhao, Yilun, Nan, Linyong, Han, Ruilin, Han, Simeng, Joty, Shafiq, Wu, Chien-Sheng, Xiong, Caiming, Radev, Dragomir

论文摘要

人类评估是对摘要系统和自动指标的评估的基础。但是，现有的人类摘要评估研究要么表现出低通量的一致性，要么表现不足，并且缺乏对人类评估的深入分析。因此，我们解决了沿以下轴的现有摘要评估的缺点：（1）我们提出了一个基于细粒度的语义单元，并允许高通道间协议，提出了修改后的摘要显着性协议，原子内容单位（ACUS）。（2）我们策划了强大的摘要评估（ROSE）基准，这是一个大型人类评估数据集，由28个摘要级别的注释在三个数据集上的28个表现最佳系统上组成。（3）我们对四种人类评估方案进行了比较研究，强调了评估设置中潜在的混杂因素。（4）我们使用在评估方案中收集的人类注释评估50个自动指标及其变体，并证明我们的基准如何导致统计稳定和重要的结果。我们以大型语言模型（LLM），GPTSCORE和G-EVAL的方式进行了基准测试的指标包括最新方法。此外，我们的发现对评估LLM具有重要意义，因为我们表明，通过人类反馈调整（例如GPT-3.5）调整的LLM可能会过分拟合不受约束的人类评估，这受到注释者的先验，输入 - 不稳定偏好的影响，要求提供更多可靠的，有针对性的，有针对性的，有针对性的评估方法。

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale, and an in-depth analysis of human evaluation is lacking. Therefore, we address the shortcomings of existing summarization evaluation along the following axes: (1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units and allows for a high inter-annotator agreement. (2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems on three datasets. (3) We conduct a comparative study of four human evaluation protocols, underscoring potential confounding factors in evaluation setups. (4) We evaluate 50 automatic metrics and their variants using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. The metrics we benchmarked include recent methods based on large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings have important implications for evaluating LLMs, as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题