论文标题
可变筛选的稳健距离相关性
Robust distance correlation for variable screening
论文作者
论文摘要
高维数据通常在现代统计应用中看到,可变选择方法在确定科学发现的关键特征中起着必不可少的作用。传统的最佳子集选择方法在计算上具有大量特征,而正规化方法(例如Lasso,SCAD及其变体)在超高维数据中表现较差,因为计算效率低和不稳定的算法。确保筛选方法已成为流行的替代方法,首先使用简单的措施(例如边际相关性)快速降低维度,然后应用任何正则化方法。已经开发了许多针对不同模型或问题的筛选方法,但是,这些方法均未针对具有沉重尾巴的数据,这是现代大数据的另一个重要特征。在本文中,我们提出了一个稳健的距离相关性(``RDC'')的肯定筛选方法,以用重尾数据进行超高维回归进行筛选。所提出的方法具有与原始无模型距离相关筛选相同的良好属性,同时具有额外的优点,可以在数据重新尾部时估算距离相关性并提高筛选中的模型选择性能。与其他基于模型或无模型的筛选程序相比,我们在不同尾巴的不同情况下进行了广泛的模拟,以证明我们提出的程序的优势,并具有改进的功能选择和预测性能。我们还将该方法应用于癌症基因组图集(TCGA)胰腺癌队列的高维重尾RNA测序(RNA-SEQ)数据,并且RDC被证明胜过优先级的其他方法,优先考虑最重要的和生物学上有意义的基因。
High-dimensional data are commonly seen in modern statistical applications, variable selection methods play indispensable roles in identifying the critical features for scientific discoveries. Traditional best subset selection methods are computationally intractable with a large number of features, while regularization methods such as Lasso, SCAD and their variants perform poorly in ultrahigh-dimensional data due to low computational efficiency and unstable algorithm. Sure screening methods have become popular alternatives by first rapidly reducing the dimension using simple measures such as marginal correlation then applying any regularization methods. A number of screening methods for different models or problems have been developed, however, none of the methods have targeted at data with heavy tailedness, which is another important characteristics of modern big data. In this paper, we propose a robust distance correlation (``RDC'') based sure screening method to perform screening in ultrahigh-dimensional regression with heavy-tailed data. The proposed method shares the same good properties as the original model-free distance correlation based screening while has additional merit of robustly estimating the distance correlation when data is heavy-tailed and improves the model selection performance in screening. We conducted extensive simulations under different scenarios of heavy tailedness to demonstrate the advantage of our proposed procedure as compared to other existing model-based or model-free screening procedures with improved feature selection and prediction performance. We also applied the method to high-dimensional heavy-tailed RNA sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer cohort and RDC was shown to outperform the other methods in prioritizing the most essential and biologically meaningful genes.