论文标题
调查低资源语言数据集创建,策划和分类的方法:Setswana和Sepedi
Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi
论文作者
论文摘要
自然语言处理的最新进展一直是代表性的语言的福音,就可用的策划数据和研究资源而言。低资源语言的挑战之一是关于针对不同用例的数据集的收集,策划和准备的明确指南。在这项工作中,我们承担了创建两个数据集的任务,这些数据集专注于新闻头条(即简短的文本)和SETSWANA和SEPEDI以及创建新闻主题分类任务。我们记录了我们的工作,还提出了分类的基准。我们调查了数据增强的方法,更适合低资源语言,以提高分类器的性能
The recent advances in Natural Language Processing have been a boon for well-represented languages in terms of available curated data and research resources. One of the challenges for low-resourced languages is clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creation of two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and creation of a news topic classification task. We document our work and also present baselines for classification. We investigate an approach on data augmentation, better suited to low resource languages, to improve the performance of the classifiers