论文标题
最小化多维FFT中的沟通
Minimizing communication in the multidimensional FFT
论文作者
论文摘要
我们为快速傅立叶变换(FFT)提出了一种平行算法。该算法将周期性到环状的一维平行算法概括为环状到环状多维并行算法,同时仅保留只需要单个全能通信步骤的属性。这是我们最多使用$ \ sqrt {n} $处理器的限制,用于在数组上的fft,总计$ n $元素,无论尺寸$ d $或数组的形状如何。我们唯一的假设是$ n $足够复合。我们的算法以相同的数据分配开始和结束。 我们介绍了多维实现FFTU,该实现利用了其本地FFT的顺序FFTW程序,并且可以处理任何维度$ d $。我们使用MPI在超级计算机Snellius的4096个核心上获得了$ D \ LEQ 5 $的实验结果,将FFTU与平行FFTW程序以及PFFT和HEFFTE进行了比较。这些结果表明,FFTU与最新技术具有竞争力,并且它允许人们使用大量的处理器,同时保持通信仅限于单个全部操作。对于$ 1024^3 $和$ 64^5 $的尺寸阵列,FFTU在4096处理器上分别实现了149和176的速度。
We present a parallel algorithm for the fast Fourier transform (FFT) in higher dimensions. This algorithm generalizes the cyclic-to-cyclic one-dimensional parallel algorithm to a cyclic-to-cyclic multidimensional parallel algorithm while retaining the property of needing only a single all-to-all communication step. This is under the constraint that we use at most $\sqrt{N}$ processors for an FFT on an array with a total of $N$ elements, irrespective of the dimension $d$ or the shape of the array. The only assumption we make is that $N$ is sufficiently composite. Our algorithm starts and ends in the same data distribution. We present our multidimensional implementation FFTU which utilizes the sequential FFTW program for its local FFTs, and which can handle any dimension $d$. We obtain experimental results for $d\leq 5$ using MPI on up to 4096 cores of the supercomputer Snellius, comparing FFTU with the parallel FFTW program and with PFFT and heFFTe. These results show that FFTU is competitive with the state of the art and that it allows one to use a larger number of processors, while keeping communication limited to a single all-to-all operation. For arrays of size $1024^3$ and $64^5$, FFTU achieves a speedup of a factor 149 and 176, respectively, on 4096 processors.