论文标题
与不变坐标选择的串联聚类
Tandem clustering with invariant coordinate selection
论文作者
论文摘要
对于多元数据,串联聚类是一种众所周知的技术,旨在通过降低初始维度来改善聚类识别。然而,使用主要成分分析(PCA)的通常方法被批评仅专注于惯性,因此第一个组件不一定保留了关注的聚类结构。为了解决此限制,提出了一种基于不变坐标选择(IC)的新串联聚类方法。通过将两个散射矩阵进行对角线化,ICS旨在在提供仿射不变成分的同时找到数据中的结构。以前已经得出了某些理论结果,并确保在某些椭圆混合模型下,可以在第一个和/或最后一个组件的子集中突出显示组结构。然而,在聚类的背景下,IC赢得了最小的关注。与IC相关的两个挑战包括选择一对散点矩阵并选择要保留的组件。为了有效的聚类目的,最好的散点对由一个散点矩阵组成,该矩阵捕获了群集内结构,另一个捕获了全局结构。对于前者而言,局部形状或成对散射引起了人们的极大兴趣,基于精心选择的子集大小的最小协方差决定因素(MCD)估计量也是如此。根据保留数据中的群集结构,评估了ICS作为降低方法的性能。在一项基准数据集的广泛的模拟研究和经验应用中,在有和没有异常值的情况下比较了散点矩阵以及组件选择标准的各种组合。总体而言,与IC的串联聚类的新方法显示出令人鼓舞的结果,并且显然超过了基于PCA的方法。
For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.