In September 2016, we released DeepBench, an open source benchmarking tool that measures the performance of basic operations involved in training deep neural networks. These operations are executed on different hardware platforms using neural network libraries. Deepbench is available as a repository on github.
The primary purpose of DeepBench is to benchmark operations that are important to deep learning on different hardware platforms. DeepBench uses the neural network libraries to benchmark the performance of basic operations. It does not measure the performance of deep learning frameworks or the time to train an entire model. The figure below describes how DeepBench fits into the deep learning eco system.
We built DeepBench with two goals in mind. The first goal was to answer the question, “Which hardware provides the best performance on the basic operations used for training deep neural networks?”. By measuring the performance of deep learning operations, DeepBench helped determine which hardware worked best for training. In addition, DeepBench is also an platform for communicating deep learning requirements to hardware manufacturers.
DeepBench included five basic building blocks of deep learning training: matrix multiply, convolutions, recurrent operations (vanilla and Long Short Term Memory (LSTM)) and all reduce. We benchmarked both the forward and backward pass for these operations and focused on training performance for single precision floating point arithmetic. More details regarding the methodology, operations and hardware can be found on DeepBench blog.
We've received a lot of positive feedback for DeepBench from both industry and academia. Since the first release, we've added results for three more Nvidia GPUs namely TitanXp, 1080 Ti and P100. DeepBench has become a standard for measuring deep learning performance on new hardware developed for training.
We've also received requests for supporting new features in DeepBench. Deep learning is a very fast moving field with progress being made in a wide variety of applications. Today, we'd like to share an updated version of DeepBench with a focus on inference performance.
The single most important feature request that we received over the last year is to extend DeepBench to benchmark inference kernels. Benchmarking inference is a very challenging problem. There are many applications that have been enabled by deep learning and each of them have their unique performance characteristics. In order to collect kernels for benchmarking, we surveyed various applications across Baidu. We selected applications for benchmarking that receive high user traffic. In addition, we are also including kernes from deep learning models that are commonly used across different applications. For example, Deep Residual Networks [5] are commonly used for several vision applications and we've included Resnet architecture in our inference kernels.
Deep learning inference also places interesting constraints on the deep learning models. For example, real world applications using speech recognition typically require real time transcription with low latency. As mentioned in Deep Speech 2 [2], the bidirectional recurrent model isn't suitable for speech recognition applications with real time constraints. Instead, we had to deploy a forward only recurrent model that satisfied real time latency constrains. The inference kernels in DeepBench are selected from deep learning models that are already deployed and satisfy latency and throughput constraints of real world applications.
For the inference kernels, we cover the same set of operations as the training set i.e. matrix multiply, convolution and recurrent operations. The kernels have some differences from the training counterparts. In the next few sections, we discuss the changes needed to benchmark inference workloads.
Large scale real world applications such as image search, language translation and speech recognition are typically deployed on servers located in data centers. The client sends the request over the internet which is processed on the remote server hosting the deep learning model. The remote server is typically a powerful machine consisting of many processors. The memory and compute capabilities are large enough to host very large deep learning models. The downside of deploying the model on the server is the latency depends on the network bandwidth between the client and the server. It also requires the user to be connected to the internet. In order to address these issues, several models are being deployed on mobile devices. [1, 11, 12] On-device deployment enables deep learning models to have lower latency and are always available regardless of internet connectivity. However, these models need to be smaller in order to fit within the power and memory constraints of mobile and embedded devices.
In DeepBench, we measure the performance of inference kernels on both server and edge platforms. For server platforms, we benchmark 3 Nvidia GPUs: TitanX Pascal, TitanXp and 1080Ti. The mobile hardware market is very fragmented and there are many different hardware systems used in a wide variety of mobile and embedded devices. DeepBench includes results on three ARM systems namely, Raspberry Pi 3 and iPhone 6 and 7. Hardware vendors or independent users can run the appropriate benchmarks and add their results to the repository. We provide an overview of the results below and detailed results are available in the DeepBench repository. We will gladly accept pull requests for new hardware platforms.
In order to meet latency requirements of user requests, most internet applications process requests individually as they arrive at the data center. This makes for a straightforward application where each request is handled by a single thread. However, this is inefficient for two reasons. Processing requests individually makes the operation bandwidth bound as the processor needs to load weights of the network. This makes it harder for processor to effectively utilize the on chip caches. Secondly, the amount of parallelism that can be exploited to classify one request is limited, making it difficult to exploit SIMD or multicore parallelism. RNNs are especially challenging to deploy because evaluating RNNs sample by sample relies on matrix vector multiplication, which are bandwidth bound and difficult to parallelize.
To overcome these issues, we built a batching scheduler called Batch Dispatch [2] that assembles streams of data from user requests into batches before performing forward propagation on these batches. In this case, there is a tradeoff between increased batch size, and consequently improved efficiency, and increased latency. The more we buffer user requests to assemble a large batch, the longer users must wait for their results. This places constraints on the amount of batching we can perform.
In practice, we've seen that batching requests up to 4 or 5 seems to work well for efficiency and latency for data center deployment. In the case of deployment on devices, the batch size is limited to 1.
Deep Neural networks are trained using single precision 32 bit floating point numbers. We discuss more about the precision requirements for training in section 3.2. The precision requirements for inference are significantly lower than training. Several different models can deployed with 8 bit representations for inference with little or no loss in accuracy compared to their floating point models. [4, 9, 15] Therefore, for inference kernels, we're specifying the minimum precision for multiplication and accumulation of 8 and 32 bits respectively. Not all hardware platforms support these minimum precision requirements. We will accept results for any platform with precision support larger than the minimum specified precision. All results will include the precision used for the benchmark.
To benchmark matrix multiplication with 8 bit inputs for ARM processors, we use the Gemmlowp library. Convolution kernels from the ARM Compute Library are used for convolution benchmark. The ARM Compute library only supports single precision convolutions. Low precision convolution support should be available shortly. The ARM Compute library doesn't have any support for RNNs. Therefore, DeepBench doesn't include RNN results for ARM devices. We welcome contributions from other libraries that support RNN operations for ARM devices.
For server deployment, we use the cudNN library and cuBLAS library for Nvidia GPUs. For Nvidia GPUs, RNN kerenels only support single precision and results are reported with the same. More details regarding which ops are supported on different processors can be found in section 4.
A sparse neural network is one where most of the weights of the neural network are zero. These zero weights don't contribute in determining the prediction of the neural network. Sparse neural networks reduce memory and computation footprint which enables deep learning models to be deployed on mobile devices. Inference performance of RNNs is dominated by the memory bandwidth of the hardware, since most of the work is simply reading in the parameters at every time step. Moving from a dense calculation to a sparse one comes with a penalty, but if the sparsity factor is large enough, then the smaller amount of data required by the sparse routines becomes a win.
The more powerful server class processors used in data centers can generally perform inference quickly enough to serve one user, but in the data center performance per dollar is very important. Techniques such as sparsity that allow models to be evaluated faster enable more users to be served per GPU increasing the effective performance per dollar.
There has been a lot of progress in developing sparse neural networks in the past couple of years. [4, 13] In DeepBench, we're including sparse matrix vector and sparse matrix multiply kernels. Based on our research, we've learned that neural networks with 90~95% sparsity can achieve relatively good performance compared to their dense baselines. However, current implementations of sparse matrix multiply are optimized for much higher sparsity (around 99% or higher). By including sparse kernels, we're hoping to incentivize hardware vendors and software developers to build libraries that provide good performance for sparsity in the range of 90~95%.
We use the Eigen library to benchmark sparse operations on ARM devices. For GPU benchmarks, we use the cuSparse library from Nvidia.
Many inference applications have realtime latency requirements. For example, speech interfaces require speech recognition models to return a result without a delay that is noticeable to a user. DeepBench kernels can be used as a starting point to measure the best case latency of individual operations. However, measuring full system latency is outside the scope of this release of DeepBench, given the focus on basic operations rather than complete applications. For example, a complete application running on a mobile device might need to modify the power state of the system when starting up. In another example, a complete server application might have a significant latency component that is determined by a user's network connection to the server. We may consider addressing operation latency in a future version of DeepBench.
In this version, we're also including an update to the training kernels. We've included kernels from new applications that we've seen here at Baidu. In addition, we've seen a lot of new research in existing applications. In the next couple of sections, we discuss the two major additions to the training version of benchmark.
In the first version of DeepBench, we included kernels for vanilla and LSTM [6] RNNs. Another commonly used recurrent architecture is the Gated Recurrent Unit (GRU) [3]. A recent comprehensive study showed that the GRU architecture is comparable to LSTMs with properly initialized gates [10]. GRUs have fewer parameters than LSTMs and should be more efficient than LSTMs for training and inference.
Several deep learning models use GRUs as the recurrent cell and we've included GRU kernels in DeepBench in training and inference workloads. GRUs have very similar performance characteristics to the LSTM units. Similar to vanilla RNNs and LSTMs, it is possible to optimize the performance by combing several layers. In this benchmark, we focus on the performance of single GRU layer. The internal non linearity of the GRU should be sigmoid for the reset and update gate. The non linearity for the output gate should be a ReLu.
While training deep learning models, most researchers typically use single precision floating point numbers for all compute kernels. Academic research has demonstrated that reduced precision training works for several different models trained on limited datasets. [7, 8, 14]. In our experience, we've found that 16 bit half precision floating point numbers are sufficient to train large deep learning models on large datasets reliably. Training with half precision numbers allows hardware vendors to better utilize the available computing power. In addition, the weights require half the total storage for the entire model.
With this update, we are specifying precision for multiply and add for all the operations used in training. The minimum precision for multiplication and addition is set to 16 and 32 bits respectively. None of the currently available hardware support 16 bit multiply and 32 bit accumulate. We will accept results on any hardware platform that satisfies this minimum precision requirement. All results will include the precision that is used for the benchmark.
In this section, we document the support for the various operations across precisions for different processors. As far as possible, we pick the precision that closely matches the minimum required precision. The precision requriments are stated below again. However, there are cases where we need to benchmark higher precision operations. The tables below highlight which operationsare benchmarked for each processor.
Minimum Precision for training: 16 bit multiply, 32 bit accumulate
Minimum Precision for inference: 8 bit multiply, 32 bit accumulate
Single precision results are available for 6 Nvidia GPUs and Intel's Xeon Phi processor. None of the available processors support 16 bit multiplication and 32 bit addition. Instead, we benchmark Nvidia's Psuedo FP16 mode where inputs/outputs are 16 bit but the compute is still in single precision. Support for low precision training is available in upcoming hardware processors.
Processor | Single precision | FP16 inputs/FP32 math |
---|---|---|
Nvidia TitanX Maxwell | GEMM, Conv, RNN | |
Nvidia Tesla M40 | GEMM, Conv, RNN | |
Nvidia 1080Ti | GEMM, Conv, RNN | |
Nvidia TitanX Pascal | GEMM, Conv, RNN | |
Nvidia TitanXp | GEMM, Conv, RNN | |
Nvidia Tesla P100 | GEMM, Conv, RNN | GEMM, Conv, RNN |
Intel Xeon Phi 7250 | GEMM, Conv | |
The GEMM and convolution benchmark are run with 8 bit multiplication and 32 bit accumulate on NVIDIA processors. However, NVIDIA GPUs don't support all input sizes for this precision mode. Input sizes have to be a multiple of 4 to run in this precision mode. We have padded inputs dimensions to be multiples of 4 for all kernels. The cost of padding and discarding extra outputs is small compared to the cost of the operation. The results spreadsheet indicates which of the kernels required padding. Sparse operations and Recurrent kernel results are reported in single precision since the relevant libraries don't support low precision.
Processor | Single Precision | Int8 multiply/32 bit accumulate |
---|---|---|
Nvidia 1080Ti | RNN, Sparse GEMM | GEMM, Conv |
Nvidia TitanX Pascal | RNN, Sparse GEMM | GEMM, Conv |
Nvidia TitanXp | RNN, Sparse GEMM | GEMM, Conv |
The table below describes the inference device kernel results available on different processors, ops and precision. We don't have any results for RNNs since no ARM libraries support RNNs. ARM Compute library is not yet supported on the iPhone.
Processor | Single Precision | Int8 inputs/32 bit math |
---|---|---|
Raspberry Pi 3 | Conv | GEMM, Sparse GEMM |
iPhone 6 | GEMM, Sparse GEMM | |
iPhone 7 | GEMM, Sparse GEMM | |
In this section, we are documenting the performance for a few operations. These are picked at random and are only meant to demonstrate the performance for a few applications. The results below only include the time and FLOPs for the fastest processor for the particular operation and parameters. The full results can be found in the repository.
Results on more hardware platforms or different precision will be added once they are available. We welcome contributions from all hardware vendors and library developers.
Example results for GEMMs, Convolution, vanilla RNN and LSTM for training kernels are present in the first DeepBench release. The next section provides results for some example GRU kernels. In section 5.1.2, we document the performance of psuedo half precision GEMM kernels.
The recurrent op kernels are only run on NVIDIA hardware.
Hidden Units | Batch Size | TimeSteps | Application | Time (ms) | Fwd TeraFLOPS | Processor |
---|---|---|---|---|---|---|
2816 | 32 | 1500 | Speech Recognition | 1790.75 | 4.05 | TitanX Pascal |
2048 | 32 | 375 | Speech Recognition | 317.09 | 3.11 | 1080 Ti |
1024 | 64 | 1500 | Speaker ID | 357.04 | 3.06 | TitanXp |
In the results below, inputs and outputs are 16 bit but still use 32 bit compute.
For the GEMM kernels, M, N & K represent the matrix size. The sizes of the two matrices are MxK and KxN.
Kernel | Application | Results (ms) | TeraFLOPS | Processor |
---|---|---|---|---|
M=1760, N=128, K=1760 | Speech Recognition | 0.23 | 3.42 | P100 |
M=7860, N=64, K=2560 | Speech Recognition | 0.61 | 4.15 | P100 |
M=2560, N=64, K=2560 | Speech Recognition | 0.33 | 2.57 | P100 |
The next few sections provide a few results for GEMM, Convolution and Recurrent operations for inference kernels on server platforms. Results on Intel platforms should be available shortly.
Kernel | Application | Results (ms) | TeraFLOPS | Processor |
---|---|---|---|---|
M=5124, N=700, K=2048 | Speech Recognition | 0.46 | 31.94 | 1080 Ti |
M=35, N=700, K=2048 | Speech Recognition | 0.05 | 2.09 | 1080 Ti |
M=3072, N=3000, K=1024 | Speech Recognition | 0.49 | 38.36 | Titan Xp |
M=512, N=6000, K=2816 | Speech Recognition | 0.43 | 40.71 | Titan Xp |
Kernel | Sparsity | Application | Results (ms) | Speedup wrt dense | TeraFLOPS | Processor |
---|---|---|---|---|---|---|
M=7680, N=1, K=2560 | 0.95 | Speech Recognition | 0.03 | 6.56 | 1.10 | 1080 Ti |
M=7680, N=2, K=2560 | 0.95 | Speech Recognition | 0.04 | 5.93 | 1.74 | 1080 Ti |
M=7680, N=1500, K=2560 | 0.95 | Speech Recognition | 29.81 | 0.16 | 1.88 | TitanXp |
M=10752, N=1, K=3584 | 0.9 | Speech Recognition | 0.1 | 4 | 0.72 | TitanXp |
Input Size | Filter Size | # of Filters | Padding (h, w) | Stride (h, w) | Application | Time (ms) | TeraFLOPS | Processor |
---|---|---|---|---|---|---|---|---|
W = 341, H = 79, C = 32, N = 4 | R = 5, S = 10 | 32 | 0,0 | 2,2 | Speech Recognition | 0.29 | 8.45 | TitanXp |
W = 224, H = 224, C = 3, N = 1 | R = 7, S = 7 | 64 | 3, 3 | 2, 2 | Computer Vision | 0.14 | 1.64 | TitanXp |
W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1 | 128 | 0, 0 | 2, 2 | Computer Vision | 0.015 | 3.43 | TitanX Pascal |
W = 7, H = 7, C = 512, N = 2 | R = 1, S = 1 | 2048 | 0, 0 | 1, 1 | Computer Vision | 0.018 | 11.42 | 1080 Ti |
Hidden Units | Batch Size | TimeSteps | Recurrent Type | Application | Results (ms) | Fwd TeraFLOPS | Processor |
---|---|---|---|---|---|---|---|
1536 | 4 | 50 | LSTM | Language Modelling | 6.93 | 0.55 | TitanXp |
256 | 4 | 150 | LSTM | Character Language Modelling | 1.63 | 0.19 | 1080 Ti |
2816 | 1 | 1500 | GRU | Speech Recognition | 350.62 | 0.20 | TitanXp |
2560 | 2 | 375 | GRU | Speech Recognition | 75.02 | 0.39 | TitanXp |
Kernel | Application | Results (ms) | GigaFLOPS | Processor |
---|---|---|---|---|
M=5124, N=700, K=2048 | Speech Recognition | 212.84 | 69.03 | iPhone 7 |
M=35, N=700, K=2048 | Speech Recognition | 1.94 | 51.69 | iPhone 7 |
M=3072, N=1500, K=1024 | Speech Recognition | 136.63 | 69.07 | iPhone 7 |
Kernel | Sparsity | Application | Results (ms) | Speedup wrt dense | GigaFLOPS | Processor |
---|---|---|---|---|---|---|
M=7680, N=1, K=2560 | 0.95 | Speech Recognition | 1.01 | 15.55 | 18.55 | iPhone 7 |
M=7680, N=1500, K=2560 | 0.95 | Speech Recognition | 1677.36 | 5.46 | 16.70 | iPhone 7 |
M=7680, N=1, K=2560 | 0.9 | Speech Recognition | 2.1 | 8.02 | 8.41 | iPhone 7 |
Input Size | Filter Size | # of Filters | Padding (h, w) | Stride (h, w) | Application | Time (ms) | GigaFLOPS | Processor |
---|---|---|---|---|---|---|---|---|
W = 112, H = 112, C = 64, N = 1 | R = 1, S = 1 | 64 | 0, 0 | 1, 1 | Computer Vision | 670.75 | 0.15 | Raspberry Pi 3 |
W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1 | 128 | 0, 0 | 2, 2 | Computer Vision | 185.87 | 0.28 | Raspberry Pi 3 |
W = 7, H = 7, C = 512, N = 1 | R = 1, S = 1 | 2048 | 0, 0 | 1, 1 | Computer Vision | 735.28 | 0.14 | Raspberry Pi 3 |
We welcome contributions from the community to DeepBench. You can contribute in the following ways:
We would like to thank Mohammad Shoeybi, Kexin Zhao, Markus Kliegl, Thuan Nguyen & Julia Li from Baidu Research for their contributions to DeepBench and our partners at Intel, Nvidia & ARM for their guidance.