一个长度驱动变压器

论文标题

一个长度驱动变压器

A Length-Extrapolatable Transformer

论文作者

Sun, Yutao, Dong, Li, Patra, Barun, Ma, Shuming, Huang, Shaohan, Benhaim, Alon, Chaudhary, Vishrav, Song, Xia, Wei, Furu

论文摘要

位置建模在变压器中起关键作用。在本文中，我们专注于长度外推，即在评估较长序列的同时对短文进行培训。我们将注意力分辨率定义为外推的指标。然后，我们提出了两种设计，以改善上述变压器的度量。具体而言，我们引入了一个相对位置嵌入，以显式最大化注意力解决方案。此外，我们在推断过程中使用块因因果关注以更好地分辨率。我们通过语言建模评估不同的变压器变体。实验结果表明，我们的模型在插值和外推设置中均达到了强劲的性能。该代码将在https://aka.ms/lex-transformer上找到。

Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题