论文标题
分布式深度学习的弹性散装同步平行模型
Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning
论文作者
论文摘要
批量同步平行(BSP)是通用平行计算的著名同步模型,已成功用于机器学习模型的分布式培训。 BSP的普遍缺点是,它要求工人在每次迭代中等待Straggler。为了改善经典BSP的这种缺点,我们提出了弹性BSP的模型,旨在放松其严格的同步要求。提出的模型在训练阶段提供了更大的灵活性和适应性,而无需牺牲训练有素的模型的准确性。我们还提出了一种实现该模型的有效方法,称为Zipline。该算法是可调的,可以有效地平衡融合质量和迭代吞吐量之间的权衡,以适应不同的环境或应用。彻底的实验评估表明,与经典BSP相比,我们提出的ElasticBSP模型的收敛速度更快,并且准确性更高。与其他明智的同步模型相比,它还具有可比性(如果不是更高)的精度。
The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-purpose parallel computing that has successfully been employed for distributed training of machine learning models. A prevalent shortcoming of the BSP is that it requires workers to wait for the straggler at every iteration. To ameliorate this shortcoming of classic BSP, we propose ELASTICBSP a model that aims to relax its strict synchronization requirement. The proposed model offers more flexibility and adaptability during the training phase, without sacrificing on the accuracy of the trained model. We also propose an efficient method that materializes the model, named ZIPLINE. The algorithm is tunable and can effectively balance the trade-off between quality of convergence and iteration throughput, in order to accommodate different environments or applications. A thorough experimental evaluation demonstrates that our proposed ELASTICBSP model converges faster and to a higher accuracy than the classic BSP. It also achieves comparable (if not higher) accuracy than the other sensible synchronization models.