图像值得16x16单词：用于大规模图像识别的变压器

论文标题

图像值得16x16单词：用于大规模图像识别的变压器

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

论文作者

Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, Uszkoreit, Jakob, Houlsby, Neil

论文摘要

尽管变压器体系结构已成为自然语言处理任务的事实上的标准，但其在计算机视觉上的应用仍然有限。在视力中，注意要么与卷积网络一起应用，要么用于替代卷积网络的某些组成部分，同时保持其整体结构。我们表明，这种对CNN的依赖不是必需的，并且直接应用于图像贴片序列的纯变压器可以很好地在图像分类任务上执行。当对大量数据进行预训练并转移到多个中型或小型图像识别基准（ImageNet，Cifar-100，VTAB等）时，视觉变压器（VIT）与先进的卷积网络相比，视觉变压器（VIT）取得了出色的成绩，同时需要较少的计算资源来训练。

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

下载PDF全文

下载文献需遵守相关版权规定

论文标题