VLMIXER：通过跨模式cutmix预训练未配对的视觉语言

论文标题

VLMIXER：通过跨模式cutmix预训练未配对的视觉语言

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

论文作者

Wang, Teng, Jiang, Wenhao, Lu, Zhichao, Zheng, Feng, Cheng, Ran, Yin, Chengguo, Luo, Ping

论文摘要

现有的视觉语言预训练（VLP）方法主要依赖于配对的图像文本数据集，这些数据集由大量人类劳动注释，或者从互联网上爬行，然后是精心设计的数据清洁技术。为了减少对良好的图像文本对的依赖，有望直接利用仅大规模的仅文本和仅图像的语料库。本文提出了一种数据增强方法，即跨模式cutmix（CMC），用于在未配对的VLP中进行隐式跨模式对齐学习。具体而言，CMC将自然句子从文本视图转换为多模式视图，在句子中，句子中的视觉词语单词被带有类似语义的各种图像贴片随机替换。拟议中的CMC有几个吸引人的礼节。首先，它增强了数据多样性，同时保持语义含义完好无损地解决了对齐数据稀缺的问题；其次，通过将跨模式噪声连接到单模式数据上，它指导模型以学习跨模态的令牌级相互作用，以更好地降级。此外，我们提出了一种名为VLMIXER的新的未配对的VLP方法，该方法将CMC与对比度学习集成在一起，以将Uni-Mododal和多模式视图汇总在一起，以在不同方式之间进行更好的实例级别对齐。在五个下游任务上进行的广泛实验表明，VLMIXER可以超过先前的最新未配对VLP方法。

Existing vision-language pre-training (VLP) methods primarily rely on paired image-text datasets, which are either annotated by enormous human labors, or crawled from the internet followed by elaborate data cleaning techniques. To reduce the dependency on well-aligned image-text pairs, it is promising to directly leverage the large-scale text-only and image-only corpora. This paper proposes a data augmentation method, namely cross-modal CutMix (CMC), for implicit cross-modal alignment learning in unpaired VLP. Specifically, CMC transforms natural sentences from the textual view into a multi-modal view, where visually-grounded words in a sentence are randomly replaced by diverse image patches with similar semantics. There are several appealing proprieties of the proposed CMC. First, it enhances the data diversity while keeping the semantic meaning intact for tackling problems where the aligned data are scarce; Second, by attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising. Furthermore, we present a new unpaired VLP method, dubbed as VLMixer, that integrates CMC with contrastive learning to pull together the uni-modal and multi-modal views for better instance-level alignments among different modalities. Extensive experiments on five downstream tasks show that VLMixer could surpass previous state-of-the-art unpaired VLP methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题