UNINET：与卷积，变压器和MLP的统一体系结构搜索

论文标题

UNINET：与卷积，变压器和MLP的统一体系结构搜索

UniNet: Unified Architecture Search with Convolution, Transformer, and MLP

论文作者

Liu, Jihao, Huang, Xin, Song, Guanglu, Li, Hongsheng, Liu, Yu

论文摘要

最近，变压器和多层感知器（MLP）架构在各种视觉任务上取得了令人印象深刻的结果。但是，如何有效地结合这些操作员形成高性能混合视觉体系结构仍然是一个挑战。在这项工作中，我们通过提出一种新型的统一体系结构搜索方法来研究卷积，变压器和MLP的可学习组合。我们的方法包含两个关键设计，以实现搜索高性能网络。首先，我们以统一的形式对截然不同的可搜索运算符进行建模，从而使操作员可以用相同的配置参数进行表征。这样，总体搜索空间规模大大减少，总搜索成本变得可负担。其次，我们提出上下文感知的倒数采样模块（DSM），以减轻不同类型的运算符之间的差距。我们提出的DSM能够更好地适应不同类型的操作员的功能，这对于识别高性能混合体系结构很重要。最后，我们将可配置的运算符和DSM集成到一个统一的搜索空间中，并使用基于增强学习的搜索算法进行搜索，以充分探索操作员的最佳组合。为此，我们搜索一个基线网络并扩大规模，以获取一个名为UNINET的模型系列，该模型的准确性和效率比以前的Convnets和Transformers更好。特别是，我们的UNET-B5在ImageNet上获得了84.9％的TOP-1准确性，表现优于效率网络-B7和Botnet-T7，分别较少44％和55％。通过在Imagenet-21K上进行预处理，我们的UNET-B6达到了87.4％，表现优于Swin-L，拖鞋少51％，参数减少了41％。代码可在https://github.com/sense-x/uninet上找到。

Recently, transformer and multi-layer perceptron (MLP) architectures have achieved impressive results on various vision tasks. However, how to effectively combine those operators to form high-performance hybrid visual architectures still remains a challenge. In this work, we study the learnable combination of convolution, transformer, and MLP by proposing a novel unified architecture search approach. Our approach contains two key designs to achieve the search for high-performance networks. First, we model the very different searchable operators in a unified form, and thus enable the operators to be characterized with the same set of configuration parameters. In this way, the overall search space size is significantly reduced, and the total search cost becomes affordable. Second, we propose context-aware downsampling modules (DSMs) to mitigate the gap between the different types of operators. Our proposed DSMs are able to better adapt features from different types of operators, which is important for identifying high-performance hybrid architectures. Finally, we integrate configurable operators and DSMs into a unified search space and search with a Reinforcement Learning-based search algorithm to fully explore the optimal combination of the operators. To this end, we search a baseline network and scale it up to obtain a family of models, named UniNets, which achieve much better accuracy and efficiency than previous ConvNets and Transformers. In particular, our UniNet-B5 achieves 84.9% top-1 accuracy on ImageNet, outperforming EfficientNet-B7 and BoTNet-T7 with 44% and 55% fewer FLOPs respectively. By pretraining on the ImageNet-21K, our UniNet-B6 achieves 87.4%, outperforming Swin-L with 51% fewer FLOPs and 41% fewer parameters. Code is available at https://github.com/Sense-X/UniNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题