使用多重输入数据集的变量选择：在堆叠和分组方法之间进行选择

论文标题

使用多重输入数据集的变量选择：在堆叠和分组方法之间进行选择

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods

论文作者

Du, Jiacong, Boss, Jonathan, Han, Peisong, Beesley, Lauren J, Goutman, Stephen A, Batterman, Stuart, Feldman, Eva L, Mukherjee, Bhramar

论文摘要

当需要同时回归系数估计和可变选择时，在许多生物医学应用中使用了惩罚的回归方法，例如LASSO和弹性NET。但是，缺少数据使这些方法的实现复杂化，尤其是在使用多个插补处理丢失时。在每个估算的数据集上应用可变选择算法可能会导致不同的选定预测变量集，从而使最终的活动集很难确定而无需求助于临时组合规则。在本文中，我们考虑了一类惩罚目标功能的一类，通过构造跨乘数数据集的相同变量的力量选择。通过跨弹药汇总目标函数，然后在所有估算的数据集上共同执行优化，而不是针对每个数据集进行优化。我们考虑文献中存在的两个目标函数公式，我们将其称为“堆叠”和“分组”的目标函数。在现有工作的基础上，我们（a）得出并实施了有效的循环坐标下降和对连续和二进制结果数据的多数级最小化优化算法，（b）合并了自适应收缩惩罚，（c）通过仿真比较这些方法，以及（d）开发R package a package for s for in for s in s in s in s in s in s in MILECTATION。模拟表明，“堆叠”的目标函数方法往往更有效地有效，并且具有更好的估计和选择属性。我们将这些方法应用于密歇根大学ALS患者存储库（UMAPR）的数据，该数据旨在确定持续的有机污染物与ALS风险之间的关联。

Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. In this paper we consider a general class of penalized objective functions which, by construction, force selection of the same variables across multiply-imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as "stacked" and "grouped" objective functions. Building on existing work, we (a) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for both continuous and binary outcome data, (b) incorporate adaptive shrinkage penalties, (c) compare these methods through simulation, and (d) develop an R package miselect for easy implementation. Simulations demonstrate that the "stacked" objective function approaches tend to be more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Repository (UMAPR) which aims to identify the association between persistent organic pollutants and ALS risk.

下载PDF全文

下载文献需遵守相关版权规定

论文标题