【Paper Reading】文章读后总结:2014年《An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS》

DaDianNao: A Machine-Learning Supercomputer [2014] 21-04-29阅

-1 感悟

感悟就是,已经接受了我是个垃圾制造机… T _ T…

0 ABSTRACT

​ Considering that the various applications of AI algorithms showing up increasingly, there are proposed a number of neural network accelerators for higher computational capacity/area ratio but limited by memory accesses. This paper proposes a customized architecture for machine learning with multiple chips, which achieves a speedup of 450.65x over a GPU and reducing the energy by 150.31x on average for a 64-chip system on some largest networks.

1 RELATED WORK

​ Temam [2] proposed a neural network but not a DNNs accelerator for multi-layer perceptrons. Esmaeilzadeh et al. [3] proposed a NPU for approximating program function via a hardware neural network, which is not dedicated for machine learning. Chen et al. [4] an accelerator for DNNs. However, there accelerators are all limited with the size of neural network and the storage of NNs’ computing values. Meanwhile, Chen et al. [4] ensured the phenomenon of the bottleneck of memory access in neural network accelerators.

2 STATE-OF-THE-ART MACHINE-LEARNING TECHNIQUES

2.1 Main Layer Types

​ Consisted of four types of layers: pooling layers (POOL), convolutional layers (CONV), classifier layers (CLASS) and local response normalization layers (LRN), both CNNs and DNNs achieve effective classification for the output.

  • CONV: A layer of CONV can map the feature data into a new matrix via sets of filters or kernels.

  • POOL: A layer of POOL aims to get the max or average over a region data.

  • LRN: A layer of LRN is used for intensifying competitions of neurons, strengthening the priority of dominant weights.

  • CLASS: A layer of CLASS often consists of multi-perceptrons and serves as output categories for classification.

2.2 Benchmarks

​ This article uses 10 of the largest known layers of each type and a full CNN from the ImageNet 2012 competition as benchmarks. The details of configurations of each layers are recorded in the paper.

3 THE GPU OPTION

​ The paper evaluates the performance of different layer types mentioned as above in CUDA with a GPU (NVIDIA K20M, 5GB GDDR5, 208 GB/s memory bandwidth, 3.52 TFlops peak, 28nm technology) and a 256-bit SIMD CPU (Intel Xeon E5-4620 Sandy Bridge-EP, 2.2GHz, 1TB memory). According to the analysis, it shows that GPUs own great efficiency on LRN layers due to the feature of SIMD. However, the drawbacks of GPUs are obvious for their high cost, incompatibility with industrial applications and moderate energy efficiency.

4 THE ACCELERA TOR OPTION

​ Chen et al. [5] proposed the DianNao accelerator for faster and better energy efficiency in computation of large CNNs and DNNs, which consists of buffers for input/output neurons and synapses, and a NFU. According to the reproduction of Chen’s article, it finds that the main limitations of Chen’s architecture is the bottleneck of memory bandwidth in the convolutional layers and classifier layers, which is thus the optimization goal of this article.

5 A MACHINE-LEARNING SUPERCOMPUTER

​ In this part, the paper proposes an architecture for high performance in machine learning via cheaper multiple chips than typical GPU, whose on-chip storage is fully enough for the memory needs in DNNs or CNNs.

5.1 Overview

​ For solving the requirements of memory storage and bandwidth, the paper decides that:

  • Let synapses stores near to the neurons for less time and energy to data move. Besides, the architecture has no main memory but is fully distributed.
  • The system is apparently biased towards storage rather than computations.
  • Transfer neurons values rather than synapses values for less bandwidth.
  • Separate the local storage into many tiles for better internal bandwidth.

5.2 Node

  • Synapses Close to Neurons: through locating the storage for synapses close to neurons, it can achieve only moving neurons for low overheads data transfers and high internal bandwidth.
  • High Internal Bandwidth: the paper uses a tile-based design to avoid congestion. The output is generated into different tiles.
  • Configurability: the tile and the NFU pipeline can be adjusted according to the different layers and the execution mode.

5.3 Interconnect

  • Considering that the bottleneck of the heavy communications shows a few layers, which is resulted from considerable reuse of neurons value, the paper turns to commercially available high-performance interfaces rather than customized interconnect for better performance. Besides, the paper implements the router via wormhole routing.

5.4 Programming, Code Generation and Multi-Node Mapping

  • This architecture can be viewed as a system ASIC. At the beginning, the input data is partitioned across the nodes and stored in a central eDRAM. Then the neural network can be deployed by node instructions, which drive the control of each tile.
  • Every output neurons values from the end of layers, which is the input neurons of the next layer, will be stored back in the central eDRAM.

6 METHODOLOGY

6.1 Measurements

  • The paper uses the ST 28nm Low Power (LP) technology (0.9V), the Synopsys Design Compiler for the synthesis, ICC Compiler for the layout, and Synopsys PrimeTime PX for power consumption estimation.
  • Use VCS to simulate the node RTL.
  • The baseline of GPU in the article is the NVIDIA K20M GPU.

6.2 Baseline

​ The paper uses the CUDA versions from a tuned open-source as the baseline for maximizing the quality.

7 EXPERIMENTAL RESULTS

  • Nearly a half area is occupied by 16 tiles, meanwhile, nearly a half of the chip is occupied by memory cells.

  • The power of the chip proposed by this paper is only about 5-10% of the SOTA GPU.

  • According to the analysis in the experiment, it finds that the 1-node, 4-node, 16-node and 64-node architectures achieve 21.38x, 79.81x, 216.72x, and 450.65x speed-up than the GPU baseline respectively. The excellent performance of the 1-node is resulted from the large number of operators and the necessary bandwidth provided by on-chip eDRAM.

  • Besides, the 1-node, 4-node, 16-node and 64-node architectures can reduce the energy by 330.56x, 323.74x, 276.04x, and 150.31x compared with the GPU baseline respectively, showing a stable energy efficiency with the increasing of the number of nodes.

8 REFERENCE

[1]Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609-622, doi: 10.1109/MICRO.2014.58.

[2] O. Temam. A Defect-Tolerant Accelerator for Emerging High-Performance Applications. In International Symposium on Computer Architecture, Portland, Oregon, 2012.

[3] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural Acceleration for General-Purpose Approximate Programs. In International Symposium on Microarchitecture, number 3, pages 1–6, 2012.

[4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2014

上一篇:一氧化碳(CO)荧光探针cas855751-82-5,二氧化硫荧光探针 激发波长653 nm,发射波长836 nm-齐岳介绍


下一篇:84.事件的简介