Open Source IP Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 10
Download | ![]() |
Open PDF In Browser | View PDF |
Report of Open-Source IPs PI: Deming Chen (UIUC) In collaboration with Zhiru Zhang (Cornell) February 21, 2019 1 Introduction This document describes an open-source IP repository specifically designed for machine learning applications, such as convolution and pooling IPs for convolutional neural network (CNN), longterm recurrent convolutional network (LRCN), etc. Each IP is provided with: introduction, interface description, inputs and outputs description, parameter configuration, resource and performance, as well as a github link to download the source code. The IPs include: 1. 2. 3. 4. 5. Standard convolution IPs Depth-wise seperatable convolution IPs Pooling IPs Bounding box regression IP Long-term Recurrent Convolutional Network IP The IPs are developed in C/C++. The source code are synthesizable through Xilinx Vivado High Level Synthesis (VivadoHLS), and Register Transfer Level (RTL) code can be generated conveniently using VivadoHLS. The source code for the above mentioned IPs can be found in the github: https://github.com/DNN-Accelerators/Open-Source-IPs In addition, we further present two full-fledged FPGA accelerator designs for machine learning applications: spam filtering and binarized neural networks. A brief introduction to the accelerator designs is presented in Section 3. The source code of the designs is available at: https://github.com/cornell-zhang/rosetta 2 IP Repository 2.1 2.1.1 Standard Convolution IPs Introduction Convolution computation is the most common component in a DNN model. Given its various sizes and types of convolution computation, we developed a configurable standard convolution IP template, which can accept run-time arguments to complete convolution layer tasks in DNN models with flexible layer configurations. This IP can be configured with hardware parameters to accommodate different resource and performance requirement. 1 2.1.2 Interface The interfaces of input/output use memory mapped AXI4 bus protocol. The bus width is 128 bit. The Control signals use AXI-lite GPIO register interface. 2.1.3 Inputs and Outputs • Inputs: feature map data of size (Hin , Win , Cin ). The input feature map size can be specified by the user according to the input image size or intermediate results. Hin and Win represents the height and width of the input data, respectively, and Cin represents the number of input channels. These arguments will also be used for corresponding input data address computation during the IP’s execution. • Output: feature map data of size (Hout , Wout , Cout ). The output feature map size can be specified by the user according to the input image size or intermediate results. Hout and Wout represent the height and width of the output data, respectively, and Cout represents the number of output channels. These arguments will also be used for corresponding output data address computation during the IP’s execution. • Weights: weight data of size (K, K, Cin , Cout ) as well as bias data of size (Cout ). The data are supposed to be stored as flattened array in the off-chip memory (DDR). 2.1.4 Parameter Configuration Configurable Run-time Parameters: The IP is capable of executing different convolution tasks under different run-time arguments to achieve application flexibility, including: 1. Input feature map dimension size (Hin , Win , Cin ). the output feature map dimension is determined by the IP accordingly. 2. Kernel size K and kernel stride S. The convolution kernel size and kernel stride can be parsed into the IPs as argument K and argument S. These arguments will also be used for corresponding weight data address computation during IP’s execution. 3. Weight data precision Wdata . The IP can accept 8 or 6 bits as weight precision options. Configurable Hardware Performance Parameters: The IP can be configured into different block sizes and data precision options to achieve the best efficiency in different platforms, including: 1. Computation Parallel Factor Din and Dout . The computation parallel factor decides how many multiply and add operations are performed each cycle in the computation module. The larger Din and Dout are, the faster the computation can be conducted, and the shorter the IP latency is, but the more resources (mainly DSPs and LUTs) are occupied. Currently, Din and Dout can only be set as 8,16 or 32. 2. Input/Output buffer size IBUFFSIZE and OBUFFSIZE. These two parameters decide the size of the input/output ping-pong buffers. IPs with larger buffer size can store more input/output on-chip data and thus can reduce data communication overhead, but also occupy more BRAM resource. 2 layer Conv1 Conv2 Conv3 Conv4 Conv5 2.1.5 Latency (6 bit) 0.789 ms 1.060 ms 0.660 ms 0.699 ms 0.555 ms Table 1: Performance Result in AlexNet Latency Input Size Output Size (8 bit) (H, W, C) (H, W, C) 0.871 ms (224,224,3) (55,55,96) 1.824 ms (55,55,96) (55,55,256) 1.049 ms (27,27,128) (13,13,192) 1.045 ms (13,13,192) (13,13,192) 0.789 ms (13,13,192) (13,13,128) Kernel Config (K, S) (11,4) (5,1) (3,1) (3,1) (3,1) Resource and Performance: In Table 1 we use Xilinx Zynq ZCU102 Evaluation Kit as the hardware platform to verify the IP. We list the IP performance for convolution layers in AlexNet with hardware configuration of IBUFFSIZE as 8192, OBUFFSIZE as 2048, Din = 16 and Dout = 32 in Table 1. We are also planning to implement and verify the streaming interface for this IP as our next step, so that the inter-layer streaming can be possible under proper IP integration and IP task assignment. 2.2 2.2.1 Depth-wise Separable Convolution IPs Introduction The depth-wise Conv K ×K IP is used to conduct depthwise separable convolution computation, which is first proposed in [1] and subsequently used in Inception models [2] to reduce the computation. Different from the standard convolution computations, it does a spatial convolution performed independently over each channel of the input, followed by a point-wise convolution, i.e. a 1x1 convolution (to be introduced in Section 2.3), projecting the channels output by the depthwise convolution onto a new channel space. One of its successful applications is on the MobileNet [4], which achieves higher classification accuracy on ImageNet with up to 60× parameter reduction compared to the most popular models such as GoogleNet [5] and VGG [6]. Given the promising performance of depthwise convolution, we provide the depth-wise Conv K × K open source IP to conduct its computation. We will introduce its inputs, outputs, configurable parameters, implementation block diagrams and performance in detail. 2.2.2 Interface The communication between Programmable Logic (PL) and Processing System (PS) is memory mapped AXI4 bus protocol. The bus width is 512 bit. 2.2.3 Inputs and Outputs Inputs and Outputs: • Inputs: The IP takes feature map data of size (Hin , Win , Cin ) as inputs. • Outputs: The outputs are feature map data of size (Hout , Wout , Cout ), where Cout = Cin . The data should be stored as three dimension arrays in the on-chip memory (BRAM) to achieve best computational performance. • Weights: The IP consumes weights of size (K, K, Cin ). Given the property of depth-wise convolution, the weights does not need a Cout dimension. 3 2.2.4 Parameter Configuration Configurable Run-time Parameters: The IP is capable of executing different convolution tasks under different run-time arguments to achieve application flexibility, including: 1. Input feature map dimension (Hin , Win , Cin ). The input feature map size can be specified by the user according to the input image size or intermediate results; the output feature map dimension is determined by the IP itself accordingly. 2. Kernel size K. The convolution kernel size can be configured by changing parameter K. Generally the most commonly used kernel sizes are 3, 5 and 7. 3. Stride S. The stride of the convolution computation can be specified by changing parameter S. Usually the stride is 1 or 2. When S = 1, the output data dimension is the same as input, where Hout = Hin , Wout = Win ; when S = 2, the output data dimension will shrink by 2, where Hout = Hin /2, Wout = Win /2. Configurable Hardware Performance Parameters: The IP can be configured into different block sizes and data precision options to achieve the best efficiency in different platforms, including: 1. Parallel degree P along C dimension. The parallel degree indicates how many multiplication operations can be executed within a same clock cycle. Given a fixed input size, the larger P is, the faster the computation can be conducted, and the shorter the IP latency is, but the more resources (mainly DSPs and LUTs) are occupied. According to the FPGA resources, the user can specify the parallel degree. Note that the parallel degree P shall be a divisor of Cin . 2. Data precision of input/output feature map and weights. Users can specify the data precision to be either floating point or fixed point. For fixed point, it can be specified in the format of < I, F >, where I represents the number of integer bits, and F represents the number of fractional bits. 2.2.5 Resource and Performance In Figure 1 we show an example diagram of a depth-wise conv 3 × 3 with stride 1 to illustrate our IP architecture. In this example the input data dimension is (40, 20, 16), the output dimension is (40, 20, 16), and the parallel degree P = 16. As the figure shows, there are 16 computational units to conduct multiplication in parallel. In Table 2 we provide some data of IP performance and resource usage under different configurations on the Pynq-Z1 FPGA board. In this table, the data precision for feature map is 8 bit fixed point with 2-bit integer, and the weights are 10 bit fixed point with 1-bit integer. The parallel factors are set to be 4, 8 and 16, respectively. The latency is represented as the number of clock cycles under different parallel configuration. As shown in the table, the larger the parallel factor is, the shorter latency is, and the more resources are occupied. 2.3 Point-wise Convolution 1 × 1 IP The point-wise convolution 1 × 1 IP is usually used after a depthwise separable convolution to combine the output channels, as described in Section 2.2. Actually it can be regarded as a special case of a standard convolution computation, which have been discussed in Section 2.1, so we omit detailed descriptions here. Similar to the depthwise conv IP, Table 3 shows its performance and resource usage on the Pynq-Z1 board. 4 di16 16 input channels w16 di2 do2 do1 w2 di1 do16 16 output channels w1 weight data in data out … channel 2 data_in16 weight16 channel 16 channel 15 * channel 1 data_in15 weight15 data_in2 weight2 * data_in1 weight1 * * … + + + + accumulate data_out1 data_out15 data_out2 data_out16 Figure 1: Depth-wise 3 × 3 convolutional IP design Table 2: Performance of Depth-Wise 3 × 3 IP on Pynq-z1 Board [3] Paral. Latency Resource IP Factor # of cycles LUT DSP Flip-Flop DW-Conv 3x3 4 8 16 4 8 16 DW-Conv 5x5 53206 38807 18117 120075 64007 30996 1866 2177 4394 2001 2668 4966 (1.4%) (1.2%) (8.3%) (3.8%) (5.0%) (9.3%) 16 (7.3%) 16 (7.3%) 36 (16.4%) 16 (7.3%) 16 (7.3%) 36 (16.4%) 722 (0.7%) 1549 (1.5%) 2027 (2.0%) 738 (0.7%) 554 (0.5%) 1045 (1.0%) Table 3: Performance of Point-Wise 1 × 1 IP on Pynq-z1 Board [3] Paral. Latency Resource IP Factor # of clks LUT DSP Flip-Flop Conv 1x1 4 8 16 50012 29875 14378 3318 (6.2%) 48 (21.8%) 4517 (4.6%) 5076 (9.5%) 64 (29.1%) 4920 (4.6%) 11871 (22.3%) 130 (59.1%) 10580 (9.9%) 5 Table 4: Performance of Max Pooling 2 × 2 IP on Pynq-z1 Board [3] Paral. Latency Resource IP Factor # of clks LUT DSP Flip-Flop Pooling 2x2 2.4 4 8 16 2805 1411 815 1037 (2.0%) 4 (1.8%) 825 (0.8%) 895 (1.7%) 4 (1.8%) 758 (0.7%) 807 (1.5%) 4 (1.8%) 739 (0.7%) Down-sampling (pooling) IP Down sampling, also called pooling, is another very common component in most deep neural networks. Pooling is used to reduce the spatial dimensions, which helps gain computation performance, avoid over-fitting and improve translation invariance. The interface protocols are the same as above mentioned IPs. • Input and Output: The inputs and outputs of pooling IP are similar to the depth-wise conv IP but no weights are required. The input/out data shall also be stored in the on-chip memory. • Configurable Parameters: The parameters for Pooling K × K IP include: 1. Pooling size K, which indicates how much the input is down sampled by its spacial dimension. Most common choices are K = 2 and K = 3. When K = 2, the x and y dimensions of the input data are downsampled by a factor of 2, and when K = 3, x and y dimensions are downsampled by a factor of 3. 2. Pooling method. We support three most commonly used pooling methods: max pooling, average pooling and sum pooling. 3. Input feature map dimension (Hin , Win , Cin ). Similar to depthwise conv IP, the output data dimension is decided by the pooling size. 4. Parallel degree P along C dimension, data precision of input/output feature map. These parameters are similar to depthwise conv IP. • Resource and Performance: In Table 4 we provide some performance data of pooling IP. The configuration is K = 2, S = 1, and it is demonstrated using max pooling method. 2.5 Bounding Box Regression Most IPs we provide are convolution and pooling, which are mostly used for feature extraction in image classification. In order to support more types of deep neural networks for different applications, we provide an IP for object detection task. Different from image classification, object detection requires the neural network to draw a bounding box on the detected object. It is usually done by a bounding box regression component after convolutional layers. For this purpose, we borrow the regression algorithm from the popular YOLO [7], and implement it as a configurable IP on FPGA. The input of this IP is the feature map of the last convolution layer, and the output is the coordinates of the detected bounding boxes. The configurable parameters of this IP include: 1) the input feature map dimension; 2) the intermediate data precision during regression; and 3) the number of anchor boxes and their aspect ratios, as described in [7]. It provides the flexibility that the user can alter this IP according to the object features to be detected. 6 2.6 2.6.1 Long-term Recurrent Convolution Network IP Introduction Apart from General purpose DNN component IPs, we also developed an image content recognition IP based on Long-term Recurrent Convolution Network. The IP takes image as input and generates descriptive sentence as output. The overall network flow is shown in figure 2. The input image is first processed by the CNN module for feature extraction. The extracted feature vector is then fed into the RNN module for recurrent word generation. The LRCN computation flow is implemented and packed into a single IP with the structure shown in figure 3. The IP have two memory AXI interface for input/output data transportation and one block control interface for operation control. The input interface module is responsible for reading in input data (including image data and neural network parameters) and stream the data into CNN component and LSTM component in requested order. The output interface module is responsible for writing data out back to the off chip memory. 2.6.2 Interface The input and output interfaces between IP and DDR are memory mapped AXI4 bus protocol. The bus width is 512 bit. The Control signals use AXI-lite GPIO register interface. Figure 2: LRCN Network Flow Figure 3: LRCN IP Structure • Input and Output: The overall LRCN IP accepts image and rearranged weight data as input and generates word index sequence as the output. 7 • Configurable Components: The CNN component and RNN component is composed by configurable convolution and fully connected modules. The users may alter these modules for different CNN or LSTM structures. 1. Convolution module. Similar to the standard convolution IPs described in Section 2.1, its configurable parameters include: input dimensions IH × IH × ID, output dimensions OH × OH × OD, kernel dimensions F W × F H, data precision (8 bit, 12bit or 16 bit) and input/output parallel factor (8 or 16). The difference is, the convolution module in the LRCN IP uses stream interface to receive weight data to achieve best performance. 2. Fully Connected mocules. The configurable parameters include: input vector length ID, output vector length OD, data precision (8 bit, 12bit or 16 bit) and input/output parallel factor (8 or 16). 3. LSTM (long short-term memory) module. The LSTM module generates predicted output, and stores the intermediate data in BRAM. In the next execution, it takes the stored intermediate data in BRAM as a part of its inputs and generate next output. The intermediate data are stored in streaming type to achieve lowest latency. 2.6.3 Resource and Performance: We collect the resource and performance data of the image content recognition IP with AlexNet as CNN component and LSTM as the RNN component. We used Xilinx Virtex-7 VC709 evaluation platform with XC7VX690T FPGA for LRCN IP evaluation, and used PCIe for the host-chip data transmission. The LRCN IP performance is shown in Table 5 and Table 6. Table 5: Resource Consumption of LRCN BRAM DSP Flip-Flop LUT 1508 3130 321195 316250 Table 6: LRCN performance on FPGA implementations Frequency Our LRCN 100MHz NVidia K80 562MHz Intel Xeon 2.6GHz 3 Virtex-7 VC709 with comparisons to CPU and GPU Latency Speedup Power Efficiency 40ms 4.75X 23.6W 0.94J/pic 124ms 1.53X 133W 16.49J/pic 190ms 1.00X 88W 16.72J/pic Open-Source FPGA Accelerators for Machine Learning Applications Aside from the open-source IPs for machine learning described in Section 2, in this section we present two open-source FPGA accelerators for machine learning applications: spam filtering and binarized neural network. These two open-source designs are implemented in C++, leveraging the Xilinx SDx design suite for high-level synthesis, logic synthesis, place & route, and bitstream generation. The designs are currently collected in the Rosetta benchmark suite [8] developed by Prof. Zhang’s group at Cornell. As a recent benchmark suite for software-programmable FPGAs, Rosetta contains fully-developed, complex applications which are representative of realistic academic and industry accelerator designs. The benchmarks in Rosetta have been tested on a 8 cloud FPGA platform (AWS F1 with Xilinx VU9P FPGA) and an embedded FPGA platform (Xilinx ZC706). Since the Xilinx toolflow on AWS is being continuously updated, we are also working on porting the Rosetta designs to the latest AWS flow. 3.1 Spam Filtering The spam filtering application uses stochastic gradient descent (SGD) to train a logistic regression model for spam email classification. Different with many FPGA accelerators that target the inference phase of machine learning models, the spam filtering accelerator tries to achieve high performance in the training phase. In our current implementation, each email is represented by a 1024-dimensional vector, thus the weight vector is also 1024-dimensional. Since the compute kernels in this application are highly parallel, parallelization techniques such as loop unrolling, loop pipelining and dataflow optimization are applied to improve performance. Our implementation features datatype customization, where the features, weights and intermediate results are represented using hardware-friendly fixed-point types. The sigmoid activation function is implemented using a look-up table to avoid exponent and division operations. Users can adjust the bitwidths of the feature vector and the weight vector, as well as the parallelization factor of compute kernels. The performance and resource utilization of the spam filtering accelerator on two Xilinx FPGA platforms are summarized in Table 7. Table 7: Performance and Resource Utilization of Spam Filtering Device BRAM DSP Flip-Flop LUT Throughput Xilinx ZC706 69 224 22134 12678 370k samples/s Xilinx VU9P 90 224 17434 7207 1.6G samples/s 3.2 Binarized Neural Network One challenge of implementing efficient neural network accelerators on FPGAs is that floating point operations are very expensive even on modern FPGA devices. As a result, quantization techniques are often applied in modern FPGA neural network accelerators, where the features and weights are quantized to fixed-point datatypes of fewer bits. Binarized neural network (BNN) [9] is an extreme of quantization, where both the weights and features are represented using only one bit. For BNNs, the MAC operations in normal neural networks are replaced by XNORs and popcount operations, which can be efficiently mapped to the LUT-rich FPGA architecture. Our binarized neural network accelerator is adopted from [10], where the accelerator targets the inference phase of the BNN model proposed in [9] and works on CIFAR-10 images. There are two major compute kernels in the BNN benchmark: binarized convolution for the convolutional layers, and binarized dot product for the fully-connected layers. In order to achieve high performance, our BNN implementation features intensive memory optimization, where a specialized line buffer is designed to maximize data reuse within the feature maps. The design is also parameterizable in that different number of convolutional units can be instantiated to achieve a trade-off between performance and resource utilization. The performance and resource utilization of the BNN accelerator on Xilinx ZC706 are summarized in Table 8. Table 8: Performance and Resource Utilization of BNN BRAM DSP Flip-Flop LUT Throughput 102 4 46760 46899 200 images/s 9 References [1] L. Sifre. Rigid-motion scattering for image classification. PhD thesis, Ph. D. thesis, 2014. [2] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [3] https://reference.digilentinc.com/reference/programmable-logic/pynq-z1/start [4] Howard, Andrew G., et al. ”Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017). [5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [6] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [7] Redmon, Joseph, et al. ”You only look once: Unified, real-time object detection.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [8] Zhou, Yuan, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen Jin, Joseph Featherston et al. ”Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs.” In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 269-278. ACM, 2018. [9] Courbariaux, Matthieu, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. ”Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1.” arXiv preprint arXiv:1602.02830 (2016). [10] Zhao, Ritchie, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. ”Accelerating binarized convolutional neural networks with software-programmable fpgas.” In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 15-24. ACM, 2017. 10
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : Yes Author : Create Date : 2019:02:21 05:34:10Z Creator : LaTeX with hyperref package Modify Date : 2019:02:21 05:34:10Z PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3 Producer : pdfTeX-1.40.18 Subject : Title : Trapped : False Page Mode : UseOutlines Page Count : 10EXIF Metadata provided by EXIF.tools