论文标题
汇总众包和自动判断,以扩大小说和Wikipedia文本的标记参考语料库
Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
论文作者
论文摘要
尽管存在用于放置参考/核心的几个数据集,但即使是最大的此类数据集也有限制,域,域范围,应用程序现象的覆盖范围以及所包括的文档的大小。然而,提出的扩大放置注释的方法并没有导致数据集克服这些局限性。在本文中,我们介绍了一个新版本的语料库,用于通过游戏中的游戏标记的放置参考。该新版本的规模可与最大的现有语料库相提并论,以部分归功于玩家的实质性活动,部分原因是使用了新的分辨率和凝集范式,通过将过滤器解析器和一种聚合方法组合组合来“完整”可“完全”的标记注释。可以采用该提出的方法来大大加快涉及游戏游戏的项目的注释时间。此外,该语料库涵盖了不存在可比大小数据集的流派(小说和Wikipedia);它涵盖了单身人士和非引用的表情;它包括大量的长文档(长度> 2k)。
Although several datasets annotated for anaphoric reference/coreference exist, even the largest such datasets have limitations in terms of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven't so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to 'complete' markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents (> 2K in length).