论文标题
ASIF:耦合数据在没有训练的情况下将单峰模型转变为多模式
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training
论文作者
论文摘要
剪辑证明,对齐视觉和语言空间是在没有明确培训的情况下解决许多视觉任务的关键,但需要在巨大的数据集中从头开始训练图像和文本编码。 LIT仅通过培训文本编码器并使用预训练的视觉网络来改善这一点。在本文中,我们表明,使用单域编码器(或不受监督的训练)和少量的图像文本对,可以完全没有任何培训创建一个公共空间。此外,我们的模型具有独特的属性。最值得注意的是,可以在几秒钟内完成使用更新的培训样本的新版本。此外,由于每个维度对应于多模式数据集中的唯一图像文本对的输入与唯一的图像文本对的相似性,因此公共空间中的表示形式很容易解释。标准零击视觉基准测试的实验证明了图像文本模型的典型转移能力。总体而言,我们的方法代表了基础多模型模型的一个简单但令人惊讶的基线,这对他们的数据效率和检索在机器学习中的作用提出了重要问题。
CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique image-text pair in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.