论文标题

部分可观测时空混沌系统的无模型预测

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

论文作者

Liu, Shiwei, Chen, Tianlong, Chen, Xiaohan, Chen, Xuxi, Xiao, Qiao, Wu, Boqian, Kärkkäinen, Tommi, Pechenizkiy, Mykola, Mocanu, Decebal, Wang, Zhangyang

论文摘要

自视觉变压器(VIT)出现以来,变形金刚在计算机视觉世界中迅速闪耀。卷积神经网络(CNN)的主要作用似乎受到越来越有效的基于变压器的模型的挑战。最近,几个先进的卷积模型以由本地窗口注意机制促进的大型内核进行了反击,显示出吸引力的性能和效率。尽管其中一个(即Replknet)令人印象深刻地设法将内核大小扩展到31x31,而性能提高,但随着内核大小的持续增长,性能开始饱和,与高级VIT(例如Swin Transformer)的扩展趋势相比。在本文中,我们探讨了训练大于31x31的极端卷积的可能性,并测试是否可以通过策略性地扩大卷积来消除性能差距。这项研究的最终是从稀疏性角度施加极大核的食谱,可以将内核平稳地扩展到61x61,并且性能更好。我们提出了稀疏的大内核网络(SLAK),这是一种纯CNN架构,配备了稀疏分解的51x51内核,可以与最先进的层次变压器和现代式融合架构相提并论或更好地表现,并在Convnext和Replentem上进行诸如ImageNet分类的范围以及范围的PENTICTANCTANCTANTICTANTICTICAND和SEMTANCTANTICTANTICTANCTANTICTANTICTANTICTICANT和SENTENTACTIANS PENTICTANTIANS,并划分范围。 2007年,以及MS可可的对象检测/分割。

Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local-window attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size to 31x31 with improved performance, the performance starts to saturate as the kernel size continues growing, compared to the scaling trend of advanced ViTs such as Swin Transformer. In this paper, we explore the possibility of training extreme convolutions larger than 31x31 and test whether the performance gap can be eliminated by strategically enlarging convolutions. This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61x61 with better performance. Built on this recipe, we propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with sparse factorized 51x51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as a wide range of downstream tasks including semantic segmentation on ADE20K, object detection on PASCAL VOC 2007, and object detection/segmentation on MS COCO.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源