论文标题
混合倒置索引是浓缩检索的强大加速器
Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval
论文作者
论文摘要
倒文件结构是加速密集检索的常见技术。它根据其嵌入将文档结合在一起;在搜索过程中,它探测附近的簇W.R.T.输入查询,仅通过随后的编解码来评估其中的文档,从而避免了详尽的遍历的昂贵成本。但是,聚类始终是有损的,这导致探测的簇中的相关文档错过,从而降低检索质量。相比之下,诸如显着术语的重叠之类的词汇匹配往往是识别相关文档的强大功能。在这项工作中,我们介绍了混合倒置索引(hi $^2 $),其中嵌入簇和显着术语可以协同起作用,以加速密集的检索。为了最大程度地发挥有效性和效率,我们设计了一个群集选择器和一个术语选择器,以构建紧凑的倒置列表并有效地搜索它们。此外,我们利用简单的无监督算法以及端到端的知识蒸馏来学习这两个模块,后者进一步提高了有效性。基于对流行检索基准测试的全面实验,我们验证了群集和术语的确相互补充,从而使HI $^2 $可以在各种索引环境中具有竞争力效率,以实现无损检索质量。我们的代码和检查点可在https://github.com/namespace-pt/adon/tree/hi2上公开获取。
Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby clusters w.r.t. an input query and only evaluates documents within them by subsequent codecs, thus avoiding the expensive cost of exhaustive traversal. However, the clustering is always lossy, which results in the miss of relevant documents in the probed clusters and hence degrades retrieval quality. In contrast, lexical matching, such as overlaps of salient terms, tends to be strong feature for identifying relevant documents. In this work, we present the Hybrid Inverted Index (HI$^2$), where the embedding clusters and salient terms work collaboratively to accelerate dense retrieval. To make best of both effectiveness and efficiency, we devise a cluster selector and a term selector, to construct compact inverted lists and efficiently searching through them. Moreover, we leverage simple unsupervised algorithms as well as end-to-end knowledge distillation to learn these two modules, with the latter further boosting the effectiveness. Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings. Our code and checkpoint are publicly available at https://github.com/namespace-Pt/Adon/tree/HI2.