启用高效且灵活的FPGA虚拟化，以在云中进行深度学习

论文标题

启用高效且灵活的FPGA虚拟化，以在云中进行深度学习

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

论文作者

Zeng, Shulin, Dai, Guohao, Sun, Hanbo, Zhong, Kai, Ge, Guangjun, Guo, Kaiyuan, Wang, Yu, Yang, Huazhong

论文摘要

FPGA在为深神经网络（DNN）推理应用提供低延迟和节能解决方案方面表现出巨大的潜力。当前，云中的大多数基于FPGA的DNN加速器以时间分割的方式运行，可用于共享单个FPGA的多个用户，并且需要重新编译$ \ sim $ 100 S的开销。这样的设计导致多个用户的隔离度和大量绩效损失，这些用户远离为公共和私人云场景提供高效且灵活的FPGA虚拟化。为了解决这些问题，我们通过共享一个单一的FPGA引入了基于DNN加速器的指令体系结构集（ISA）的新颖虚拟化框架。我们通过引入两级指令调度模块和基于多核的硬件资源池来启用隔离。这样的设计提供了隔离且运行时可编程的硬件资源，进一步导致了多个用户的性能隔离。另一方面，为了克服重型重新编译开销，我们提出了一个基于平铺的指令框架的设计和两阶段的静态动力汇编。仅使用$ \ sim $ 1 ms开销重新编译轻量级运行时信息，因此保证了私有云的性能。我们广泛的实验结果表明，提出的虚拟化设计分别使用单核和多核架构来实现先前静态设计的1.07-1.69x和1.88-3.12x吞吐量的改进。

FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep neural network (DNN) inference applications. Currently, the majority of FPGA-based DNN accelerators in the cloud run in a time-division multiplexing way for multiple users sharing a single FPGA, and require re-compilation with $\sim$100 s overhead. Such designs lead to poor isolation and heavy performance loss for multiple users, which are far away from providing efficient and flexible FPGA virtualization for neither public nor private cloud scenarios. To solve these problems, we introduce a novel virtualization framework for instruction architecture set (ISA) based on DNN accelerators by sharing a single FPGA. We enable the isolation by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, further leading to performance isolation for multiple users. On the other hand, to overcome the heavy re-compilation overheads, we propose a tiling-based instruction frame package design and two-stage static-dynamic compilation. Only the light-weight runtime information is re-compiled with $\sim$1 ms overhead, thus the performance is guaranteed for the private cloud. Our extensive experimental results show that the proposed virtualization design achieves 1.07-1.69x and 1.88-3.12x throughput improvement over previous static designs using the single-core and the multi-core architectures, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题