论文标题
聚类通过多面体描述解释
Cluster Explanation via Polyhedral Descriptions
论文作者
论文摘要
聚类是一个无监督的学习问题,旨在将未标记的数据点划分为具有相似特征的组。传统的聚类算法对他们发现的群体提供了有限的见解,因为他们的主要重点是准确性,而不是小组分配的解释性。这刺激了有关可解释的机器学习的最新工作,用于聚类。在本文中,我们关注群集描述问题,鉴于数据集及其分配到簇中,任务是解释簇。我们引入了一种新方法来解释簇,通过在每个集群周围构造多面体,同时最大程度地减少所得Polyhedra的复杂性或描述中使用的特征数量。我们将群集描述问题作为整数程序提出,并提出了一种列生成方法,以搜索可用于构建Polyhedra的候选半空间。为了处理大型数据集,我们介绍了一种新颖的分组方案,该方案首先形成较小的数据点组,然后围绕分组数据构建多面体,该策略超出了一个简单的次采样数据。与最先进的群集描述算法相比,我们的方法能够通过提高描述准确性来实现竞争性解释性。
Clustering is an unsupervised learning problem that aims to partition unlabelled data points into groups with similar features. Traditional clustering algorithms provide limited insight into the groups they find as their main focus is accuracy and not the interpretability of the group assignments. This has spurred a recent line of work on explainable machine learning for clustering. In this paper we focus on the cluster description problem where, given a dataset and its partition into clusters, the task is to explain the clusters. We introduce a new approach to explain clusters by constructing polyhedra around each cluster while minimizing either the complexity of the resulting polyhedra or the number of features used in the description. We formulate the cluster description problem as an integer program and present a column generation approach to search over an exponential number of candidate half-spaces that can be used to build the polyhedra. To deal with large datasets, we introduce a novel grouping scheme that first forms smaller groups of data points and then builds the polyhedra around the grouped data, a strategy which out-performs simply sub-sampling data. Compared to state of the art cluster description algorithms, our approach is able to achieve competitive interpretability with improved description accuracy.