广泛的关注是变形金刚的前进方向？

论文标题

广泛的关注是变形金刚的前进方向？

Wide Attention Is The Way Forward For Transformers?

论文作者

Brown, Jason Ross, Zhao, Yiren, Shumailov, Ilia, Mullins, Robert D

论文摘要

变压器是一种非常有力和突出的深度学习体系结构。在这项工作中，我们挑战了人们对深度学习的普遍信念，即进展更深是更好，并展示了一种替代的设计方法，该方法正在建立更广泛的注意力变形金刚。我们证明，当两者都经过从头开始训练时，宽阔的单层变压器模型可以在各种自然语言处理（NLP）任务中竞争或胜过更深层的较深层。然后，系统地研究了更改模型纵横比对变压器的影响。该比率平衡了每层层的数量和注意力头的数量，同时保持关注头的总数和所有其他超级参数恒定。平均而言，在4个NLP任务和10种注意力类型中，单层宽模型的性能要比其深度对应物好0.3％。我们展示了深入的评估，并证明了广泛的模型需要如何更小的内存足迹，并且可以在商品硬件上更快地运行，此外，这些更广泛的模型也更容易解释。例如，IMDB字节级文本分类上的单层变压器在CPU上的推理潜伏期比其同样准确的深度对应物更快，并且是一半的大小。因此，我们将更广泛的模型作为NLP任务上的小型模型的可行且理想的替代方案，并作为超越此类领域的重要研究领域。

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We therefore put forward wider and shallower models as a viable and desirable alternative for small models on NLP tasks, and as an important area of research for domains beyond this.

下载PDF全文

下载文献需遵守相关版权规定

论文标题