最大化群集数据的多样性

论文标题

最大化群集数据的多样性

Maximizing diversity over clustered data

论文作者

Zhang, Guangyi, Gionis, Aristides

论文摘要

最大程度的多样性旨在从集合中选择各种高质量对象，这是一个基本问题，并且在Web搜索中具有广泛的应用程序。统一或分区的矩阵约束下的多样性自然描述了有用的基数或预算要求，并接受简单的近似算法。但是，当应用于聚类数据时，流行的算法（例如迭代选择对象）和执行本地搜索失去近似值保证了最大的集群内多样性，因为它们无法以全球方式优化目标。我们提出了一种贪婪地添加一对对象而不是单胎的算法，并且在群集数据上达到了一个恒定的近似因子。我们进一步将算法扩展到单调和下质量函数的情况，并在分区的矩阵约束下。我们还设计了一种使我们的算法可扩展的技术，并在获得一种修改的方式上可以在实践中提供更好的解决方案，同时保持理论上的近似保证。与合成和现实世界中的强大基线相比，我们的算法具有出色的性能。

Maximum diversity aims at selecting a diverse set of high-quality objects from a collection, which is a fundamental problem and has a wide range of applications, e.g., in Web search. Diversity under a uniform or partition matroid constraint naturally describes useful cardinality or budget requirements, and admits simple approximation algorithms. When applied to clustered data, however, popular algorithms such as picking objects iteratively and performing local search lose their approximation guarantees towards maximum intra-cluster diversity because they fail to optimize the objective in a global manner. We propose an algorithm that greedily adds a pair of objects instead of a singleton, and which attains a constant approximation factor over clustered data. We further extend the algorithm to the case of monotone and submodular quality function, and under a partition matroid constraint. We also devise a technique to make our algorithm scalable, and on the way we obtain a modification that gives better solutions in practice while maintaining the approximation guarantee in theory. Our algorithm achieves excellent performance, compared to strong baselines in a mix of synthetic and real-world datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题