在了解其架构之前先预处理神经网络

论文标题

在了解其架构之前先预处理神经网络

Pretraining a Neural Network before Knowing Its Architecture

论文作者

Knyazev, Boris

论文摘要

训练大型神经网络可以通过训练较小的超网络来预测大型参数。最近发布的图形Hypernetwork（GHN）对100万个较小的ImageNet架构进行了这种训练，能够预测大型看不见的网络（例如Resnet-50）的参数。尽管具有预测参数的网络在源任务上失去了性能，但已发现预测参数可用于对其他任务进行微调。我们研究是否基于同一GHN的微调对GHN培训后出版的新型强架构仍然有用。我们发现，对于诸如Convnext之类的最新体系结构，GHN初始化比Resnet-50有用。一个潜在的原因是，新型体系结构从用于训练GHN的建筑的分布转移增加。我们还发现，预测参数缺乏成功调整梯度下降的微调参数所需的多样性。我们通过将简单的后处理技术应用于预测参数，然后将其用于目标任务并改善Resnet-50和Convnext的微调，从而减轻这种限制。

Training large neural networks is possible by training a smaller hypernetwork that predicts parameters for the large ones. A recently released Graph HyperNetwork (GHN) trained this way on one million smaller ImageNet architectures is able to predict parameters for large unseen networks such as ResNet-50. While networks with predicted parameters lose performance on the source task, the predicted parameters have been found useful for fine-tuning on other tasks. We study if fine-tuning based on the same GHN is still useful on novel strong architectures that were published after the GHN had been trained. We found that for recent architectures such as ConvNeXt, GHN initialization becomes less useful than for ResNet-50. One potential reason is the increased distribution shift of novel architectures from those used to train the GHN. We also found that the predicted parameters lack the diversity necessary to successfully fine-tune parameters with gradient descent. We alleviate this limitation by applying simple post-processing techniques to predicted parameters before fine-tuning them on a target task and improve fine-tuning of ResNet-50 and ConvNeXt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题