双重正确的对象识别：为什么提示视觉理由

论文标题

双重正确的对象识别：为什么提示视觉理由

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

论文作者

Mao, Chengzhi, Teotia, Revant, Sundar, Amrutha, Menon, Sachit, Yang, Junfeng, Wang, Xin, Vondrick, Carl

论文摘要

许多视觉识别模型仅根据其分类准确性进行评估，这是它们获得强大性能的指标。在本文中，我们研究了计算机视觉模型是否还可以为其预测提供正确的理由。我们提出了一个``双重正确''对象识别基准，该标准要求该模型同时产生正确的标签和正确的理由。我们发现，诸如剪辑之类的最新视觉模型通常为其分类预测提供错误的理由。但是，通过通过量身定制的数据集将理由从语言模型转移到视觉表示形式中，我们表明我们可以学习一个``为什么提示''，它可以适应大型的视觉表示以产生正确的理由。可视化和经验实验表明，除了向看不见的任务和数据集的零转移外，我们的提示还显着提高了双重正确对象识别的性能。

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a ``why prompt,'' which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题