论文标题
大数据安装半参数累积概率模型
Fitting Semiparametric Cumulative Probability Models for Big Data
论文作者
论文摘要
累积概率模型(CPM)是连续结果的线性模型的鲁棒替代品。但是,由于运行时间和内存使用率升高,它们对于非常大的数据集不可行,这取决于样本量,预测指标的数量以及不同结果的数量。我们描述了三种解决这个问题的方法。在划分和综合方法中,我们将数据分为子集,将CPM适合每个子集,然后汇总信息。在包裹和舍入方法中,重新定义结果变量具有大大减少的不同值。我们考虑将四舍五入到小数位,并舍入到大数字,均采取改进步骤,以帮助实现所需的不同结果。我们通过模拟显示这些方法的性能很好,并且它们的参数估计值一致。我们研究了运行时间和峰值内存使用方式如何受样本量,不同结果的数量以及预测因子数量的影响。作为说明,我们将这些方法应用于一个大型公开可用数据集,调查了矩阵乘法运行时,并进行了近100万个观察。
Cumulative probability models (CPMs) are a robust alternative to linear models for continuous outcomes. However, they are not feasible for very large datasets due to elevated running time and memory usage, which depend on the sample size, the number of predictors, and the number of distinct outcomes. We describe three approaches to address this problem. In the divide-and-combine approach, we divide the data into subsets, fit a CPM to each subset, and then aggregate the information. In the binning and rounding approaches, the outcome variable is redefined to have a greatly reduced number of distinct values. We consider rounding to a decimal place and rounding to significant digits, both with a refinement step to help achieve the desired number of distinct outcomes. We show with simulations that these approaches perform well and their parameter estimates are consistent. We investigate how running time and peak memory usage are influenced by the sample size, the number of distinct outcomes, and the number of predictors. As an illustration, we apply the approaches to a large publicly available dataset investigating matrix multiplication runtime with nearly one million observations.