多粒性重新标记为不平衡数据的下采样算法

论文标题

多粒性重新标记为不平衡数据的下采样算法

Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data

论文作者

Dai, Qi, Liu, Jian-wei, Liu, Yang

论文摘要

事实证明，不平衡的分类问题是数据挖掘和机器学习中重要且具有挑战性的问题之一。传统分类器的性能将受到许多数据问题的严重影响，例如类不平衡问题，类重叠和噪声。 Tomek-link算法仅在提出数据时才用于清洁数据。近年来，已经有关于将Tomek-Link算法与采样技术相结合的报道。 Tomek-link采样算法可以有效地减少类对数据的重叠，删除难以区分的大多数实例，并提高算法分类精度。但是，Tomek-links下采样算法仅考虑到最近的邻居彼此之间的边界实例，而忽略了潜在的局部重叠实例。当少数族裔实例的数量很少时，不令人满意的采样效果并不令人满意，并且分类模型的性能提高并不明显。因此，根据Tomek-Link，提出了多个重新标记的下采样算法（MGRU）。该算法充分考虑了局部粒度子空间中数据集的局部信息，并检测数据集中的局部潜在重叠实例。然后，根据全球重新标记的索引值消除了重叠的多数实例，从而有效地扩大了Tomek-Link的检测范围。模拟结果表明，当我们选择最佳的全局重新标记索引值以进行下采样时，提议的不采样算法的分类准确性和概括性性能明显优于其他基线算法。

The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning. The performances of traditional classifiers will be severely affected by many data problems, such as class imbalanced problem, class overlap and noise. The Tomek-Link algorithm was only used to clean data when it was proposed. In recent years, there have been reports of combining Tomek-Link algorithm with sampling technique. The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy. However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances. When the number of minority instances is small, the under-sampling effect is not satisfactory, and the performance improvement of the classification model is not obvious. Therefore, on the basis of Tomek-Link, a multi-granularity relabeled under-sampling algorithm (MGRU) is proposed. This algorithm fully considers the local information of the data set in the local granularity subspace, and detects the local potential overlapping instances in the data set. Then, the overlapped majority instances are eliminated according to the global relabeled index value, which effectively expands the detection range of Tomek-Links. The simulation results show that when we select the optimal global relabeled index value for under-sampling, the classification accuracy and generalization performance of the proposed under-sampling algorithm are significantly better than other baseline algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题