论文标题
高效且最终一致的集体行动
Efficient and Eventually Consistent Collective Operations
论文作者
论文摘要
集体操作是经常用于高性能(HPC)和机器/深度学习(ML/ DL)应用程序的并行编程模型的常见功能。在强大的缩放场景中,集体操作可能会对整体应用效果产生负面影响:随着核心数量的增加,每等级的负载减少,而集体操作中花费的时间会增加对数。 在本文中,我们通过减少广播和减少的通信以及探索Allreduce集体的陈旧同步平行(SSP)同步模型,为最终适合ML/ DL计算的集体设计设计。此外,我们还使用经常使用的经典/一致的集体操作丰富了GASPI生态系统 - 例如大型消息的Allreduce以及HPC代码中使用的AlltoAll。与供应商提供的MPI替代方案相比,我们的实施情况显示出令人鼓舞的初步结果,尤其是Alleduce和Alltoall,尤其是Alldouce和Alltoall。
Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective operations can negatively impact the overall application performance: with the increase in core count, the load per rank decreases, while the time spent in collective operations increases logarithmically. In this article, we propose a design for eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous Parallel (SSP) synchronization model for the Allreduce collective. Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations -- such as Allreduce for large messages and AlltoAll used in an HPC code. Our implementations show promising preliminary results with significant improvements, especially for Allreduce and AlltoAll, compared to the vendor-provided MPI alternatives.