在语言建模中重新访问音节及其在低资源机器翻译上的应用

论文标题

在语言建模中重新访问音节及其在低资源机器翻译上的应用

Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation

论文作者

Oncevay, Arturo, Rojas, Kervy Dante Rivas, Sanchez, Liz Karen Chavez, Zariquiey, Roberto

论文摘要

语言建模和机器翻译任务主要使用子字或字符输入，但是很少使用音节。音节提供的序列比字符短，需要比词素更少的提取规则，并且其分割不受语料库大小的影响。在这项研究中，我们首先探讨了使用21种语言的开放式摄影语言建模的音节潜力。我们使用六种语言的基于规则的音节化方法，并使用连字符解决其余方法，该方法可作为音节仪代理。有了可比的困惑，我们表明音节的表现优于字符和其他子字。此外，我们研究了音节在神经机器翻译上的重要性，对无关和低资源的语言对（西班牙语 - shipibo-konibo）。在成对和多语言系统中，音节的表现优于无监督的子字，以及进一步的形态分割方法，将其转换为具有透明拼字法（Shipibo-Konibo）的高度合成语言时。最后，我们进行一些人类评估，并讨论局限性和机会。

Language modelling and machine translation tasks mostly use subword or character inputs, but syllables are seldom used. Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size. In this study, we first explore the potential of syllables for open-vocabulary language modelling in 21 languages. We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy. With a comparable perplexity, we show that syllables outperform characters and other subwords. Moreover, we study the importance of syllables on neural machine translation for a non-related and low-resource language-pair (Spanish--Shipibo-Konibo). In pairwise and multilingual systems, syllables outperform unsupervised subwords, and further morphological segmentation methods, when translating into a highly synthetic language with a transparent orthography (Shipibo-Konibo). Finally, we perform some human evaluation, and discuss limitations and opportunities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题