通过数据扩展改善波斯关系提取模型

论文标题

通过数据扩展改善波斯关系提取模型

Improving Persian Relation Extraction Models by Data Augmentation

论文作者

Sartakhti, Moein Salimi, Etezadi, Romina, Shamsfard, Mehrnoush

论文摘要

关系提取是预测句子或文档中实体之间语义关系类型的任务是自然语言处理中的重要任务。尽管有许多用于英语的研究和数据集，但波斯语遭受了足够的研究和全面的数据集。该任务的唯一可用的波斯数据集是Perlex，它是Semeval-2010-Task-8数据集的波斯专家翻译版本。在本文中，我们介绍了我们的增强数据集以及系统的结果和发现，参加了波斯关系提取的NSURL 2021研讨会共享任务。我们将Perlex用作基本数据集，并通过应用一些文本预处理步骤，并通过数据增强技术来提高其大小以提高应用模型的概括和鲁棒性来增强它。然后，我们使用两个不同的模型，包括Parsbert和多语言BERT，以在增强Perlex数据集上提取关系提取。我们最好的模型在比赛的测试阶段获得了64.67％的宏F1，并且在Perlex的测试集中获得了83.68％的宏F1。

Relation extraction that is the task of predicting semantic relation type between entities in a sentence or document is an important task in natural language processing. Although there are many researches and datasets for English, Persian suffers from sufficient researches and comprehensive datasets. The only available Persian dataset for this task is PERLEX, which is a Persian expert-translated version of the SemEval-2010-Task-8 dataset. In this paper, we present our augmented dataset and the results and findings of our system, participated in the Persian relation Extraction shared task of NSURL 2021 workshop. We use PERLEX as the base dataset and enhance it by applying some text preprocessing steps and by increasing its size via data augmentation techniques to improve the generalization and robustness of applied models. We then employ two different models including ParsBERT and multilingual BERT for relation extraction on the augmented PERLEX dataset. Our best model obtained 64.67% of Macro-F1 on the test phase of the contest and it achieved 83.68% of Macro-F1 on the test set of PERLEX.

下载PDF全文

下载文献需遵守相关版权规定

论文标题