论文标题
超越以英语为中心的多语言机器翻译
Beyond English-Centric Multilingual Machine Translation
论文作者
论文摘要
翻译中的现有工作证明了通过训练能够在任何一对语言之间翻译的单个模型来大规模多语言机器翻译的潜力。但是,这项工作的大部分是通过仅在翻译到英语的数据上培训的,以英语为中心。尽管这是大量培训数据来支持的,但它并未反映全球翻译需求。在这项工作中,我们创建了一个真正的多语言翻译模型,可以直接在任何100对语言之间翻译。我们构建和开源一个培训数据集,该数据集涵盖了数千种通过大规模采矿创建的有监督数据的语言方向。然后,我们探索如何通过密集规模和特定语言稀疏参数的组合有效地增加模型能力来创建高质量的模型。我们对以非英语为中心的模型的关注可以在直接在非英语方向之间转换,同时竞争性地执行WMT的最佳单一系统,从而带来了10多个BLEU的收益。我们开源我们的脚本,以便其他脚本可以重现数据,评估和最终M2M-100模型。
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.