鲁科拉：语言可接受性的俄罗斯语料库

论文标题

鲁科拉：语言可接受性的俄罗斯语料库

RuCoLA: Russian Corpus of Linguistic Acceptability

论文作者

Mikhailov, Vladislav, Shamardina, Tatiana, Ryabinin, Max, Pestova, Alena, Smurov, Ivan, Artemova, Ekaterina

论文摘要

语言可接受性（LA）由于其多种用途而引起了研究界的注意，例如测试语言模型的语法知识和具有可接受性分类器的令人难以置信的文本。但是，由于缺乏高质量的资源，洛杉矶以英语以外的语言的应用范围受到限制。为此，我们介绍了俄罗斯的语言可接受性（Rucola），该语料库是在建立良好的二进制洛杉矶方法下从头开始建造的。 Rucola由语言出版物中的$ 9.8 $ K的纳入句子和3.6 $ k的生成模型产生的外域句子组成。创建室外集合是为了促进可接受性改善语言生成的实际使用。我们的论文描述了数据收集协议，并通过一系列基线方法对可接受性分类实验进行了细粒度分析。特别是，我们证明，使用最广泛的语言模型仍然很大，尤其是在检测形态和语义错误时。我们发布了Rucola，实验守则和公共排行榜（Rucola-Benchmark.com），以评估俄罗斯语言模型的语言能力。

Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of $9.8$k in-domain sentences from linguistic publications and $3.6$k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation. Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches. In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard (rucola-benchmark.com) to assess the linguistic competence of language models for Russian.

下载PDF全文

下载文献需遵守相关版权规定

论文标题