论文标题

DIM WIHL GAT TUN:NLP的语言专业知识的案例不足语言

Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

论文作者

Forbes, Clarissa, Samir, Farhan, Oliver, Bruce Harold, Yang, Changbing, Coates, Edith, Nicolai, Garrett, Silfverberg, Miikka

论文摘要

NLP的最新进展是由利用大量数据集的经过验证的模型驱动的,并且主要使世界的政治和经济超级大国受益。技术欠缺的语言由于缺乏这样的资源而被抛弃。然而,数百种服务不足的语言以语言文档工作的形式具有与内线性光泽文本(IGT)的形式的可用数据源。 IGT在NLP的工作中仍未充分利用,也许是因为其注释仅是半结构化的,通常是特定于语言的。在本文的情况下,我们认为可以成功利用IGT数据,只要可以使用目标语言专业知识。我们特别倡导与纪录片语言学家合作。我们的论文为使用IGT数据提供了成功项目的路线图:(1)必须定义可以使用给定的IGT数据来完成哪些NLP任务,以及这些任务将如何使语音社区受益。 (2)将数据转换为NLP中通常使用的结构化格式时,需要进行大量的护理和目标语言专业知识。 (3)特定于任务和特定用户的评估可以帮助确定创建的工具是否有益于目标语言语音社区。我们介绍了一个案例研究,该案例研究涉及开发Tsimchianic语言Gitksan的形态重新感染系统。

Recent progress in NLP is driven by pretrained models leveraging massive datasets and has predominantly benefited the world's political and economic superpowers. Technologically underserved languages are left behind because they lack such resources. Hundreds of underserved languages, nevertheless, have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts. IGT remains underutilized in NLP work, perhaps because its annotations are only semi-structured and often language-specific. With this paper, we make the case that IGT data can be leveraged successfully provided that target language expertise is available. We specifically advocate for collaboration with documentary linguists. Our paper provides a roadmap for successful projects utilizing IGT data: (1) It is essential to define which NLP tasks can be accomplished with the given IGT data and how these will benefit the speech community. (2) Great care and target language expertise is required when converting the data into structured formats commonly employed in NLP. (3) Task-specific and user-specific evaluation can help to ascertain that the tools which are created benefit the target language speech community. We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源