论文标题
Nukebert:低资源核领域的预训练的语言模型
NukeBERT: A Pre-trained language model for Low Resource Nuclear Domain
论文作者
论文摘要
近年来,在机器上,在许多任务中超过人类绩效的机器,包括但不限于问答答案,已经取得了重大进展。大多数具有大型数据集和高度成熟文献的问题回答目标领域的深度学习方法。核能和原子能领域在很大程度上尚未探索用于驱动行业可行应用的非公开数据。由于缺乏数据集,因此从7000个关于核领域的研究论文创建了一个新的数据集。本文有助于理解核领域知识的研究,然后在核领域专家创建的核问题答案数据集(NQUAD)上评估,这是本研究的一部分。 Nquad包含在从Igcar研究论文语料库中随机选择的181段中提出的612个问题。在本文中,提出了核双向编码器代表变压器(Nukebert),该变压器(Nukebert)结合了一种用于构建BERT词汇的新技术,以使其适合使用较少培训数据的任务。对Nquad评估的实验表明,Nukebert能够胜过BERT,从而验证了所采用的方法。培训努克伯特在计算上很昂贵,因此我们将开放式努克伯特预审预测的重量和Nquad,以促进核领域的进一步研究工作。
Significant advances have been made in recent years on Natural Language Processing with machines surpassing human performance in many tasks, including but not limited to Question Answering. The majority of deep learning methods for Question Answering targets domains with large datasets and highly matured literature. The area of Nuclear and Atomic energy has largely remained unexplored in exploiting non-annotated data for driving industry viable applications. Due to lack of dataset, a new dataset was created from the 7000 research papers on nuclear domain. This paper contributes to research in understanding nuclear domain knowledge which is then evaluated on Nuclear Question Answering Dataset (NQuAD) created by nuclear domain experts as part of this research. NQuAD contains 612 questions developed on 181 paragraphs randomly selected from the IGCAR research paper corpus. In this paper, the Nuclear Bidirectional Encoder Representational Transformers (NukeBERT) is proposed, which incorporates a novel technique for building BERT vocabulary to make it suitable for tasks with less training data. The experiments evaluated on NQuAD revealed that NukeBERT was able to outperform BERT significantly, thus validating the adopted methodology. Training NukeBERT is computationally expensive and hence we will be open-sourcing the NukeBERT pretrained weights and NQuAD for fostering further research work in the nuclear domain.