Tatoeba翻译挑战 - 低资源和多语言MT的现实数据集

论文标题

Tatoeba翻译挑战 - 低资源和多语言MT的现实数据集

The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT

论文作者

Tiedemann, Jörg

论文摘要

本文介绍了用于机器翻译的新基准的开发，该基准为数千种语言对提供了培训和测试数据，这些语言对涵盖了500多种语言和工具，用于创建该集合中的最新翻译模型。主要目标是触发开放翻译工具和模型的开发，并以更广泛的覆盖范围覆盖世界语言。使用该软件包，可以在现实的低资源场景上工作，以避免人为地减少的设置，这些设置在演示零弹或几次学习时很常见。该软件包首次提供了数百种语言和脚本注释和数据拆分的多种语言的各种数据集的全面集合，以扩展现有基准的狭窄覆盖范围。与数据发布一起，我们还为单个语言对和选定语言组提供了越来越多的预训练的基线模型。

This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline models for individual language pairs and selected language groups.

下载PDF全文

下载文献需遵守相关版权规定

论文标题