民主化对比的语言图像预训练：数据，模型和监督的剪辑基准

论文标题

民主化对比的语言图像预训练：数据，模型和监督的剪辑基准

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

论文作者

Cui, Yufeng, Zhao, Lichen, Liang, Feng, Li, Yangguang, Shao, Jing

论文摘要

对比性语言图像预处理（剪辑）已成为一种新型范式，从语言监督中学习视觉模型。尽管研究人员继续推动剪辑的前沿，但再现这些作品仍然具有挑战性。这是因为研究人员没有选择一致的培训食谱，甚至使用不同的数据，从而阻碍了不同方法之间的公平比较。在这项工作中，我们提出了剪贴基准测试，这是第一次尝试评估，分析和基准剪辑及其变体的尝试。我们对三个关键因素进行了全面分析：数据，监督和模型体系结构。我们发现相当直观或反直觉的见解：（1）。数据质量对性能有重大影响。（2）。某些监督对卷积网络（CORVNET）和视觉变压器（VIT）具有不同的影响。应用更多适当的监督可以有效地提高剪辑的性能。（3）。减少文本编码器会降低培训成本，但不会影响最终表现。此外，我们将偏差与菲利普（Filip）相结合，这使我们成为了最强大的变体defilip。剪贴基准将在以下网址发布：https：//github.com/sense-gvt/declip，以供将来的剪辑研究。

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is because researchers do not choose consistent training recipes and even use different data, hampering the fair comparison between different methods. In this work, we propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. We conduct a comprehensive analysis of three key factors: data, supervision, and model architecture. We find considerable intuitive or counter-intuitive insights: (1). Data quality has a significant impact on performance. (2). Certain supervision has different effects for Convolutional Networks (ConvNets) and Vision Transformers (ViT). Applying more proper supervision can effectively improve the performance of CLIP. (3). Curtailing the text encoder reduces the training cost but not much affect the final performance. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP. The CLIP-benchmark would be released at: https://github.com/Sense-GVT/DeCLIP for future CLIP research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题