论文标题
Legalrelectra:远程法律文本理解的混合域语言建模
LegalRelectra: Mixed-domain Language Modeling for Long-range Legal Text Comprehension
论文作者
论文摘要
自然语言处理(NLP)在法律等专业领域的应用最近引起了人们的兴趣。由于许多法律服务都依靠处理和分析大量文档的收集,因此,使用NLP工具自动执行此类任务是一个关键挑战。许多流行的语言模型,例如Bert或Roberta,都是通用模型,它们对处理专业的法律术语和语法有局限性。此外,法律文件可能包含来自其他领域的专门词汇,例如人身伤害文本中的医学术语。在这里,我们建议LegalRelectra,这是一种法律域语言模型,接受了混合域法律和医学语料库的培训。我们表明,在处理混合域(人身伤害)文本时,我们的模型可以改善通用和单域医学和法律语言模型。我们的培训架构实现了Electra框架,但利用改革者而不是BERT来发电机和歧视者。我们表明,这可以提高模型在处理长段落方面的性能,并带来更好的远程文本理解。
The application of Natural Language Processing (NLP) to specialized domains, such as the law, has recently received a surge of interest. As many legal services rely on processing and analyzing large collections of documents, automating such tasks with NLP tools emerges as a key challenge. Many popular language models, such as BERT or RoBERTa, are general-purpose models, which have limitations on processing specialized legal terminology and syntax. In addition, legal documents may contain specialized vocabulary from other domains, such as medical terminology in personal injury text. Here, we propose LegalRelectra, a legal-domain language model that is trained on mixed-domain legal and medical corpora. We show that our model improves over general-domain and single-domain medical and legal language models when processing mixed-domain (personal injury) text. Our training architecture implements the Electra framework, but utilizes Reformer instead of BERT for its generator and discriminator. We show that this improves the model's performance on processing long passages and results in better long-range text comprehension.