论文标题
IT5:意大利语言理解和世代的文本到文本预读
IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
论文作者
论文摘要
我们介绍了IT5,这是第一个专门针对意大利语预测的编码器变压器模型的家族。我们为大型意大利语料库进行记录并执行彻底的清洁程序,并使用它为四个IT5型号的尺寸预算。然后,我们介绍了Itagen基准,该基准包括意大利语的各种自然语言理解和发电任务,并使用它来评估IT5模型和多语言基线的性能。我们发现单语IT5模型可在经过测试的模型之间提供最佳的比例比率比率,从而持续优于其多语言对应物,并为意大利语言生成设置新的最新技术。
We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.