柠檬什么时候紫色？视觉模型的概念关联偏见

论文标题

柠檬什么时候紫色？视觉模型的概念关联偏见

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

论文作者

Yamada, Yutaro, Tang, Yingtian, Zhang, Yoyo, Yildirim, Ilker

论文摘要

诸如剪辑之类的大型视觉模型在零拍图像分类和图像到文本检索方面表现出了令人印象深刻的性能。但是，这种表现在需要视觉和语言之间需要更细粒度的对应的任务中没有意识到，例如视觉问题回答（VQA）。作为将这些模型应用于VQA和类似任务的难度的潜在原因，我们报告了视觉模型的有趣现象，我们称之为概念协会偏见（CAB）。我们发现，带有CAB的模型倾向于将输入视为一袋概念，并试图填充其他缺失的概念，从而导致意外的零拍预测。我们通过证明夹子的零摄像分类性能在物体（例如茄子）和属性（例如颜色紫色）之间存在强大的概念关联时，证明了CAB。我们还表明，CAB的强度预测了VQA的性能。我们观察到，即使共同使用自回归损失，CAB在训练有对比损失的视觉模型中很普遍。但是，仅依靠自回归损失的模型似乎显示出最小的驾驶室或没有迹象的迹象。

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such performance does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that models with CAB tend to treat input as a bag of concepts and attempt to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. We demonstrate CAB by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. eggplant) and an attribute (e.g. color purple). We also show that the strength of CAB predicts the performance on VQA. We observe that CAB is prevalent in vision-language models trained with contrastive losses, even when autoregressive losses are jointly employed. However, a model that solely relies on autoregressive loss seems to exhibit minimal or no signs of CAB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题