论文标题
Murag:多模式检索仪的发电机,用于对图像和文本的回答答案
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
论文作者
论文摘要
尽管语言模型在其参数中隐含地存储了大量的世界知识,但即使是非常大的模型,也经常无法编码有关稀有实体和事件的信息,同时产生了巨大的计算成本。最近,通过利用外部非参数索引,将世界知名模型(例如Realm,Rag和Retro)纳入了世界知识的生成中,并以约束的模型大小显示出令人印象深刻的性能。但是,这些方法仅限于检索文本知识,忽略了图像等其他模式中无处不在的知识 - 其中大部分包含任何文本未涵盖的信息。为了解决这一限制,我们提出了第一个多式联运检索式变压器(Murag),该变压器(Murag)访问外部非参数多模式内存以增强语言的生成。穆拉格(Murag)使用联合对比度和生成性损失进行大规模图像文本和仅文本语料库的混合物进行预训练。我们在两个不同的数据集上执行实验,这些数据集需要在图像和文本上检索和推理以回答给定的查询:webQA和多模式。我们的结果表明,穆拉格(Murag)实现了最新的准确性,在数据集和在干扰器和全球设置下都超过了10-20 \%的绝对模型。
While language Models store a massive amount of world knowledge implicitly in their parameters, even very large models often fail to encode information about rare entities and events, while incurring huge computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, have incorporated world knowledge into language generation by leveraging an external non-parametric index and have demonstrated impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images -- much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language generation. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. We perform experiments on two different datasets that require retrieving and reasoning over both images and text to answer a given query: WebQA, and MultimodalQA. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20\% absolute on both datasets and under both distractor and full-wiki settings.