型号汤：平均多个微调模型的权重提高精度而不增加推理时间

论文标题

型号汤：平均多个微调模型的权重提高精度而不增加推理时间

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

论文作者

Wortsman, Mitchell, Ilharco, Gabriel, Gadre, Samir Yitzhak, Roelofs, Rebecca, Gontijo-Lopes, Raphael, Morcos, Ari S., Namkoong, Hongseok, Farhadi, Ali, Carmon, Yair, Kornblith, Simon, Schmidt, Ludwig

论文摘要

最大化模型准确性的常规配方是（1）具有各种超参数的多个模型，以及（2）选择在固定验证集中表现最佳的单个模型，从而丢弃其余部分。在本文中，我们在微调大型预训练的模型的背景下重新审视了该过程的第二步，其中微调模型通常位于单个低误差盆地中。我们表明，平均多种模型的权重以不同的超参数配置进行了微调通常提高准确性和鲁棒性。与传统的合奏不同，我们可能会平均许多模型，而不会产生任何其他推理或记忆成本 - 我们称结果为“模型汤”。当微调大型预训练的模型，例如夹子，Align和VIT-G在JFT上预先训练的VIT-G时，我们的汤食谱比ImageNet上的超参数扫描中的最佳模型可显着改善。所得的VIT-G模型在Imagenet上达到90.94％的TOP-1准确性，实现了新的最新状态。此外，我们表明，模型汤方法扩展到多个图像分类和自然语言处理任务，改善分发性能，并提高新下游任务的零拍性能。最后，我们通过分析将权重平衡和与logit浓度的性能相似与预测的损失和信心的平坦度联系起来，并通过经验验证这一关系。代码可在https://github.com/mlfoundations/model-soups上找到。

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

下载PDF全文

下载文献需遵守相关版权规定

论文标题