多语言Bert具有重音：评估英语对多语言模型流利度的影响

论文标题

多语言Bert具有重音：评估英语对多语言模型流利度的影响

Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models

论文作者

Papadimitriou, Isabel, Lopez, Kezia, Jurafsky, Dan

论文摘要

尽管多语言语言模型可以通过利用更高的资源语言来提高低资源语言的NLP性能，但它们还可以降低所有语言的平均性能（“多语言的诅咒”）。在这里，我们展示了多语言模型的另一个问题：高资源语言中的语法结构流血为低资源语言，这是我们称之为语法结构偏见的现象。我们通过一种新的方法显示了这种偏见，该方法将多语言模型的流利度与单语言单语和希腊模型的流利度进行比较：测试它们对两个精心选择的可变语法结构的偏爱（以西班牙语和可选的式主题 - 词语订购，以希腊语为单位和可选的主题 - drop）。我们发现，与我们的单语控制语言模型相比，多语言BERT偏向类似英语的设置（显式代词和主题 - 对象订购）。通过我们的案例研究，我们希望揭示出多种语言模型的细粒度方式，并鼓励更多地语言意识到的流利度评估。

While multilingual language models can improve NLP performance on low-resource languages by leveraging higher-resource languages, they also reduce average performance on all languages (the 'curse of multilinguality'). Here we show another problem with multilingual models: grammatical structures in higher-resource languages bleed into lower-resource languages, a phenomenon we call grammatical structure bias. We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models: testing their preference for two carefully-chosen variable grammatical structures (optional pronoun-drop in Spanish and optional Subject-Verb ordering in Greek). We find that multilingual BERT is biased toward the English-like setting (explicit pronouns and Subject-Verb-Object ordering) as compared to our monolingual control language model. With our case studies, we hope to bring to light the fine-grained ways in which multilingual models can be biased,and encourage more linguistically-aware fluency evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题