为什么基于树的模型仍然超过表格数据的深度学习？

论文标题

为什么基于树的模型仍然超过表格数据的深度学习？

Why do tree-based models still outperform deep learning on tabular data?

论文作者

Grinsztajn, Léo, Oyallon, Edouard, Varoquaux, Gaël

论文摘要

尽管深度学习在文本和图像数据集上取得了巨大进展，但其对表格数据的优势尚不清楚。我们在大量数据集和高参数组合中为标准和新型深度学习方法以及基于树的模型（例如Xgboost和随机森林）提供了广泛的基准。我们从具有表格数据的清晰特征的各个域以及针对拟合模型和找到良好的超级参数的基准测试方法的方法定义了一组45个数据集。结果表明，即使没有考虑其较高的速度，基于树的模型即使在中等大小的数据（$ \ sim $ 10K样本）上仍然是最先进的。为了了解这一差距，我们对基于树的模型和神经网络（NNS）的不同电感偏见进行了实证研究。这导致了一系列挑战，这些挑战应指导研究人员旨在构建表格特定的NNS：1。对非信息性特征保持强大，2。保留数据的方向，3。能够轻松学习不规则的功能。为了刺激对表格体系结构的研究，我们为基线的标准基准和原始数据提供了贡献：20 000计算小时的每个学习者的每个学习者的搜索。

While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题