论文标题
单流多级比对,以预读视觉训练
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
论文作者
论文摘要
从纯图像和具有对比性损失的纯图像和文本进行自我监督的视觉语言是有效的,但是由于双流式体系结构仅在全球层面上与图像和文本表示形式保持一致,因此忽略了细粒度的对齐。早些时候,受监督的,非对比度的方法具有更细粒度的对齐方式,但需要致密的注释,这些注释不可扩展。我们提出了一个单个流体系结构,该体系结构使用两个新颖的任务:对称跨模式重建(XMM)和一个伪标记的关键字预测(PSL)。在XMM中,我们从一种模态掩盖了输入令牌,并使用跨模式信息重建掩盖的令牌,从而改善了两种模态之间的细粒度对齐。在PSL中,我们使用注意力在标题中选择关键字,使用动量编码器推荐其他重要的关键字,这些关键字在标题中丢失但在图像中表示,然后训练视觉编码器以预测这些关键字的存在,帮助它学习语义概念,这些概念对于将文本图像接地到图像区域至关重要。我们在图像文本检索,接地,视觉问题上回答/推理的竞争性能并提高了数据效率,以针对更大的模型和对更多数据培训的模型。 Zaidkhan.me/simla上可用的代码和型号。
Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image-text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at zaidkhan.me/SIMLA.