论文标题
通过查询生成,弥合索引和检索之间的差距和检索之间的差距
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation
论文作者
论文摘要
可区分搜索指数(DSI)是用于信息检索的新兴范式。与索引和检索是两个不同且独立的组件的传统检索体系结构不同,DSI使用单个变压器模型同时执行索引和检索。 在本文中,我们确定并解决了当前DSI模型的重要问题:DSI索引和检索过程之间发生的数据分布不匹配。具体而言,我们认为,在索引时,当前的DSI方法学会在长文档的文本和文档的标识符之间建立连接,但是文档标识符的检索基于通常比索引文档短得多的查询。当使用DSI进行跨语言检索时,此问题进一步加剧了,文档文本和查询文本使用不同的语言。 为了解决当前DSI模型的这个基本问题,我们为DSI(称为DSI-QG)提出了一个简单而有效的索引框架。索引时,DSI-QG代表文档,具有通过查询生成模型生成的许多潜在相关查询,并通过跨编码器排名重新排列和过滤。这些查询在索引时的存在使DSI模型可以将文档标识符连接到一组查询,从而减轻索引和检索阶段之间存在的数据分布不匹配。流行的单语言和跨语性通过检索数据集的经验结果表明,DSI-QG的表现明显优于原始DSI模型。
The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.