TinyVit：小视觉变压器的快速预蒸馏

论文标题

TinyVit：小视觉变压器的快速预蒸馏

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

论文作者

Wu, Kan, Zhang, Jinnian, Peng, Houwen, Liu, Mengchen, Xiao, Bin, Fu, Jianlong, Yuan, Lu

论文摘要

视觉变压器（VIT）最近由于其出色的模型能力引起了计算机视觉的极大关注。但是，大多数盛行的VIT模型都有大量参数，从而限制了其在资源有限的设备上的适用性。为了减轻这个问题，我们提出了Tinyvit，这是一个新的小型，有效的小型视觉变压器，并通过我们提议的快速蒸馏框架在大规模数据集上预处理。核心思想是将知识从大型的模型转移到小型模型，同时使小型模型能够获得大量预处理数据的股息。更具体地说，我们在审议过程中应用蒸馏进行知识转移。大型教师模型的逻辑被稀疏并提前存储在磁盘中，以节省内存成本和计算开销。微小的学生变形金刚自动从具有计算和参数约束的大型审计模型中缩小。全面的实验证明了TinyVit的功效。它仅具有21m参数的Imagenet-1k上的前1位精度为84.8％，与在Imagenet-21K上预处理的SWIN-B相当，而使用较少的参数则使用了4.2倍。此外，增加图像分辨率，TinyVit可以达到86.5％的精度，仅使用11％参数，比SWIN-L略好。最后但并非最不重要的一点是，我们在各种下游任务上展示了TinyVit的良好转移能力。代码和型号可在https://github.com/microsoft/cream/tree/main/tinyvit上找到。

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题