论文标题

极端EDGE DNN推断的混合精液RISC-V处理器

A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

论文作者

Ottavi, Gianmarco, Garofalo, Angelo, Tagliavini, Giuseppe, Conti, Francesco, Benini, Luca, Rossi, Davide

论文摘要

低位宽度量化的神经网络(QNN)可以通过减少其内存足迹来在受约束设备(例如微控制器)(MCUS)等设备(MCUS)上部署复杂的机器学习模型。细粒度的不对称量化(即,分配给权重和激活的不同位宽度基于张量)是一个特别有趣的方案,可以在紧密的内存约束下最大化准确性。但是,在SOA微处理器中缺乏子字节指令集架构(ISA)支持,因此很难完全利用嵌入式MCU中的这种极端量化范式。对子字节和非对称QNN的支持将需要许多精确格式和大量的OpCode空间。在这项工作中,我们使用基于状态的SIMD指令来攻击此问题:而不是明确编码精度,而是在核心状态寄存器中动态设置每个操作数的精度。我们提出了一种基于开源Ri5Cy Core的新型RISC-V ISA核心MPIC(混合精度推理)。我们的方法可以在16、8-,4-和2位精度下使用不同的操作数组合进行混合精确QNN推断,而无需添加任何额外的操作码或增加解码阶段的复杂性。我们的结果表明,与基于软件的RI5CY上的混合精液相比,MPIC的性能和能源效率提高了1.1-4.9倍。关于市售的Cortex-M4和M7微控制器,它的性能提高了3.6-11.7倍,效率提高了41-155x。

Low bit-width Quantized Neural Networks (QNNs) enable deployment of complex machine learning models on constrained devices such as microcontrollers (MCUs) by reducing their memory footprint. Fine-grained asymmetric quantization (i.e., different bit-widths assigned to weights and activations on a tensor-by-tensor basis) is a particularly interesting scheme to maximize accuracy under a tight memory constraint. However, the lack of sub-byte instruction set architecture (ISA) support in SoA microprocessors makes it hard to fully exploit this extreme quantization paradigm in embedded MCUs. Support for sub-byte and asymmetric QNNs would require many precision formats and an exorbitant amount of opcode space. In this work, we attack this problem with status-based SIMD instructions: rather than encoding precision explicitly, each operand's precision is set dynamically in a core status register. We propose a novel RISC-V ISA core MPIC (Mixed Precision Inference Core) based on the open-source RI5CY core. Our approach enables full support for mixed-precision QNN inference with different combinations of operands at 16-, 8-, 4- and 2-bit precision, without adding any extra opcode or increasing the complexity of the decode stage. Our results show that MPIC improves both performance and energy efficiency by a factor of 1.1-4.9x when compared to software-based mixed-precision on RI5CY; with respect to commercially available Cortex-M4 and M7 microcontrollers, it delivers 3.6-11.7x better performance and 41-155x higher efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源