将数据集的分类具有估算的丢失值：插补质量重要吗？

论文标题

将数据集的分类具有估算的丢失值：插补质量重要吗？

Classification of datasets with imputed missing values: does imputation quality matter?

论文作者

Shadbahr, Tolou, Roberts, Michael, Stanczuk, Jan, Gilbey, Julian, Teare, Philip, Dittmer, Sören, Thorpe, Matthew, Torne, Ramon Vinas, Sala, Evis, Lio, Pietro, Patel, Mishal, Collaboration, AIX-COVNET, Rudd, James H. F., Mirtti, Tuomas, Rannikko, Antti, Aston, John A. D., Tang, Jing, Schönlieb, Carola-Bibiane

论文摘要

在不完整的数据集中对样本进行分类是机器学习从业人员的普遍目的，但并非平凡。在大多数现实世界数据集中都发现了缺少的数据，并且这些缺失值通常是使用已建立的方法估算的，然后将其分类为现在完整，估算的样本。然后，机器学习研究人员的重点是优化下游分类性能。在这项研究中，我们强调必须考虑插补的质量。我们展示了如何评估质量的常用措施有缺陷，并提出了一类新的差异评分，这些评分集中在方法上如何重新恢复数据的整体分布。总而言之，我们强调了使用不良数据训练的分类器模型的可解释性。

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification performance. In this study, we highlight that it is imperative to consider the quality of the imputation. We demonstrate how the commonly used measures for assessing quality are flawed and propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data. To conclude, we highlight the compromised interpretability of classifier models trained using poorly imputed data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题