GPUTREESHAP：大量并联精确计算树合奏的外形分数

论文标题

GPUTREESHAP：大量并联精确计算树合奏的外形分数

GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles

论文作者

Mitchell, Rory, Frank, Eibe, Holmes, Geoffrey

论文摘要

Shap（Shapley添加说明）值提供了基于Shapley值的机器学习模型预测的游戏理论解释。虽然通常在计算上的精确计算在计算上是可以棘手的，但递归多项式时间算法称为Treeshap，可用于决策树模型。但是，尽管具有多项式时间的复杂性，但在应用于大型决策树集合时，Treeshap在实用机器学习管道中可能会成为一个重要的瓶颈。不幸的是，复杂的Treeshap算法很难映射到硬件加速器（例如GPU）。在这项工作中，我们介绍了Gputreeshap，这是一种重新打开的三级算法，适用于图形处理单元上的大规模平行计算。我们的方法首先预处理每个决策树，以将可变大小的子问题与原始递归算法隔离，然后解决垃圾箱包装问题，最后将子问题映射到单个指令，多线程（SIMT）任务，用于使用专用硬件指令进行并行执行。在单个NVIDIA TESLA V100-32 GPU的情况下，我们实现高达19倍的变形值的加速度，而在两个20 20个20核Xeon Xeon Xeon E5-2698 V4 2.2 GHZ CPU上执行的塑形相互作用值的加速度最高为340倍。我们还使用八个V100 GPU进行了多GPU计算，证明每秒12m行的吞吐量 - 基于等效的CPU性能估计需要6850 CPU核心。

SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values. While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap is available for decision tree models. However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles. Unfortunately, the complicated TreeShap algorithm is difficult to map to hardware accelerators such as GPUs. In this work, we present GPUTreeShap, a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units. Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions. With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second -- equivalent CPU-based performance is estimated to require 6850 CPU cores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题