An update to DeepBench with a focus on deep learning inference

June 28, 2017

Sharan Narang

Baidu

sharan@baidu.com

Greg Diamos

Baidu

gregdiamos@baidu.com

Baidu_Research_Logo

1. Introduction

In September 2016, we released DeepBench, an open source benchmarking tool that measures the performance of basic operations involved in training deep neural networks. These operations are executed on different hardware platforms using neural network libraries. Deepbench is available as a repository on github.

The primary purpose of DeepBench is to benchmark operations that are important to deep learning on different hardware platforms. DeepBench uses the neural network libraries to benchmark the performance of basic operations. It does not measure the performance of deep learning frameworks or the time to train an entire model. The figure below describes how DeepBench fits into the deep learning eco system.

deepbench

We built DeepBench with two goals in mind. The first goal was to answer the question, “Which hardware provides the best performance on the basic operations used for training deep neural networks?”. By measuring the performance of deep learning operations, DeepBench helped determine which hardware worked best for training. In addition, DeepBench is also an platform for communicating deep learning requirements to hardware manufacturers.

DeepBench included five basic building blocks of deep learning training: matrix multiply, convolutions, recurrent operations (vanilla and Long Short Term Memory (LSTM)) and all reduce. We benchmarked both the forward and backward pass for these operations and focused on training performance for single precision floating point arithmetic. More details regarding the methodology, operations and hardware can be found on DeepBench blog.

We've received a lot of positive feedback for DeepBench from both industry and academia. Since the first release, we've added results for three more Nvidia GPUs namely TitanXp, 1080 Ti and P100. DeepBench has become a standard for measuring deep learning performance on new hardware developed for training.

We've also received requests for supporting new features in DeepBench. Deep learning is a very fast moving field with progress being made in a wide variety of applications. Today, we'd like to share an updated version of DeepBench with a focus on inference performance.

2. Inference

The single most important feature request that we received over the last year is to extend DeepBench to benchmark inference kernels. Benchmarking inference is a very challenging problem. There are many applications that have been enabled by deep learning and each of them have their unique performance characteristics. In order to collect kernels for benchmarking, we surveyed various applications across Baidu. We selected applications for benchmarking that receive high user traffic. In addition, we are also including kernes from deep learning models that are commonly used across different applications. For example, Deep Residual Networks [5] are commonly used for several vision applications and we've included Resnet architecture in our inference kernels.

Deep learning inference also places interesting constraints on the deep learning models. For example, real world applications using speech recognition typically require real time transcription with low latency. As mentioned in Deep Speech 2 [2], the bidirectional recurrent model isn't suitable for speech recognition applications with real time constraints. Instead, we had to deploy a forward only recurrent model that satisfied real time latency constrains. The inference kernels in DeepBench are selected from deep learning models that are already deployed and satisfy latency and throughput constraints of real world applications.

For the inference kernels, we cover the same set of operations as the training set i.e. matrix multiply, convolution and recurrent operations. The kernels have some differences from the training counterparts. In the next few sections, we discuss the changes needed to benchmark inference workloads.

2.1. Deployment platform

Large scale real world applications such as image search, language translation and speech recognition are typically deployed on servers located in data centers. The client sends the request over the internet which is processed on the remote server hosting the deep learning model. The remote server is typically a powerful machine consisting of many processors. The memory and compute capabilities are large enough to host very large deep learning models. The downside of deploying the model on the server is the latency depends on the network bandwidth between the client and the server. It also requires the user to be connected to the internet. In order to address these issues, several models are being deployed on mobile devices. [1, 11, 12] On-device deployment enables deep learning models to have lower latency and are always available regardless of internet connectivity. However, these models need to be smaller in order to fit within the power and memory constraints of mobile and embedded devices.

In DeepBench, we measure the performance of inference kernels on both server and edge platforms. For server platforms, we benchmark 3 Nvidia GPUs: TitanX Pascal, TitanXp and 1080Ti. The mobile hardware market is very fragmented and there are many different hardware systems used in a wide variety of mobile and embedded devices. DeepBench includes results on three ARM systems namely, Raspberry Pi 3 and iPhone 6 and 7. Hardware vendors or independent users can run the appropriate benchmarks and add their results to the repository. We provide an overview of the results below and detailed results are available in the DeepBench repository. We will gladly accept pull requests for new hardware platforms.

2.2. Smaller Batches

In order to meet latency requirements of user requests, most internet applications process requests individually as they arrive at the data center. This makes for a straightforward application where each request is handled by a single thread. However, this is inefficient for two reasons. Processing requests individually makes the operation bandwidth bound as the processor needs to load weights of the network. This makes it harder for processor to effectively utilize the on chip caches. Secondly, the amount of parallelism that can be exploited to classify one request is limited, making it difficult to exploit SIMD or multicore parallelism. RNNs are especially challenging to deploy because evaluating RNNs sample by sample relies on matrix vector multiplication, which are bandwidth bound and difficult to parallelize.

To overcome these issues, we built a batching scheduler called Batch Dispatch [2] that assembles streams of data from user requests into batches before performing forward propagation on these batches. In this case, there is a tradeoff between increased batch size, and consequently improved efficiency, and increased latency. The more we buffer user requests to assemble a large batch, the longer users must wait for their results. This places constraints on the amount of batching we can perform.

In practice, we've seen that batching requests up to 4 or 5 seems to work well for efficiency and latency for data center deployment. In the case of deployment on devices, the batch size is limited to 1.

2.3. Inference Precision

Deep Neural networks are trained using single precision 32 bit floating point numbers. We discuss more about the precision requirements for training in section 3.2. The precision requirements for inference are significantly lower than training. Several different models can deployed with 8 bit representations for inference with little or no loss in accuracy compared to their floating point models. [4, 9, 15] Therefore, for inference kernels, we're specifying the minimum precision for multiplication and accumulation of 8 and 32 bits respectively. Not all hardware platforms support these minimum precision requirements. We will accept results for any platform with precision support larger than the minimum specified precision. All results will include the precision used for the benchmark.

To benchmark matrix multiplication with 8 bit inputs for ARM processors, we use the Gemmlowp library. Convolution kernels from the ARM Compute Library are used for convolution benchmark. The ARM Compute library only supports single precision convolutions. Low precision convolution support should be available shortly. The ARM Compute library doesn't have any support for RNNs. Therefore, DeepBench doesn't include RNN results for ARM devices. We welcome contributions from other libraries that support RNN operations for ARM devices.

For server deployment, we use the cudNN library and cuBLAS library for Nvidia GPUs. For Nvidia GPUs, RNN kerenels only support single precision and results are reported with the same. More details regarding which ops are supported on different processors can be found in section 4.

2.4. Sparse kernels

A sparse neural network is one where most of the weights of the neural network are zero. These zero weights don't contribute in determining the prediction of the neural network. Sparse neural networks reduce memory and computation footprint which enables deep learning models to be deployed on mobile devices. Inference performance of RNNs is dominated by the memory bandwidth of the hardware, since most of the work is simply reading in the parameters at every time step. Moving from a dense calculation to a sparse one comes with a penalty, but if the sparsity factor is large enough, then the smaller amount of data required by the sparse routines becomes a win.

SparseNN

The more powerful server class processors used in data centers can generally perform inference quickly enough to serve one user, but in the data center performance per dollar is very important. Techniques such as sparsity that allow models to be evaluated faster enable more users to be served per GPU increasing the effective performance per dollar.

There has been a lot of progress in developing sparse neural networks in the past couple of years. [4, 13] In DeepBench, we're including sparse matrix vector and sparse matrix multiply kernels. Based on our research, we've learned that neural networks with 90~95% sparsity can achieve relatively good performance compared to their dense baselines. However, current implementations of sparse matrix multiply are optimized for much higher sparsity (around 99% or higher). By including sparse kernels, we're hoping to incentivize hardware vendors and software developers to build libraries that provide good performance for sparsity in the range of 90~95%.

We use the Eigen library to benchmark sparse operations on ARM devices. For GPU benchmarks, we use the cuSparse library from Nvidia.

2.5. Latency discussion

Many inference applications have realtime latency requirements. For example, speech interfaces require speech recognition models to return a result without a delay that is noticeable to a user. DeepBench kernels can be used as a starting point to measure the best case latency of individual operations. However, measuring full system latency is outside the scope of this release of DeepBench, given the focus on basic operations rather than complete applications. For example, a complete application running on a mobile device might need to modify the power state of the system when starting up. In another example, a complete server application might have a significant latency component that is determined by a user's network connection to the server. We may consider addressing operation latency in a future version of DeepBench.

3. Training update

In this version, we're also including an update to the training kernels. We've included kernels from new applications that we've seen here at Baidu. In addition, we've seen a lot of new research in existing applications. In the next couple of sections, we discuss the two major additions to the training version of benchmark.

3.1. GRU support

In the first version of DeepBench, we included kernels for vanilla and LSTM [6] RNNs. Another commonly used recurrent architecture is the Gated Recurrent Unit (GRU) [3]. A recent comprehensive study showed that the GRU architecture is comparable to LSTMs with properly initialized gates [10]. GRUs have fewer parameters than LSTMs and should be more efficient than LSTMs for training and inference.

Several deep learning models use GRUs as the recurrent cell and we've included GRU kernels in DeepBench in training and inference workloads. GRUs have very similar performance characteristics to the LSTM units. Similar to vanilla RNNs and LSTMs, it is possible to optimize the performance by combing several layers. In this benchmark, we focus on the performance of single GRU layer. The internal non linearity of the GRU should be sigmoid for the reset and update gate. The non linearity for the output gate should be a ReLu.

3.2. Low Precision Training

While training deep learning models, most researchers typically use single precision floating point numbers for all compute kernels. Academic research has demonstrated that reduced precision training works for several different models trained on limited datasets. [7, 8, 14]. In our experience, we've found that 16 bit half precision floating point numbers are sufficient to train large deep learning models on large datasets reliably. Training with half precision numbers allows hardware vendors to better utilize the available computing power. In addition, the weights require half the total storage for the entire model.

With this update, we are specifying precision for multiply and add for all the operations used in training. The minimum precision for multiplication and addition is set to 16 and 32 bits respectively. None of the currently available hardware support 16 bit multiply and 32 bit accumulate. We will accept results on any hardware platform that satisfies this minimum precision requirement. All results will include the precision that is used for the benchmark.

4. Supported Ops & Precision

In this section, we document the support for the various operations across precisions for different processors. As far as possible, we pick the precision that closely matches the minimum required precision. The precision requriments are stated below again. However, there are cases where we need to benchmark higher precision operations. The tables below highlight which operationsare benchmarked for each processor.

Minimum Precision for training: 16 bit multiply, 32 bit accumulate

Minimum Precision for inference: 8 bit multiply, 32 bit accumulate

4.1. Training

Single precision results are available for 6 Nvidia GPUs and Intel's Xeon Phi processor. None of the available processors support 16 bit multiplication and 32 bit addition. Instead, we benchmark Nvidia's Psuedo FP16 mode where inputs/outputs are 16 bit but the compute is still in single precision. Support for low precision training is available in upcoming hardware processors.


Processor	Single precision	FP16 inputs/FP32 math


Nvidia TitanX Maxwell	GEMM, Conv, RNN

Nvidia Tesla M40	GEMM, Conv, RNN

Nvidia 1080Ti	GEMM, Conv, RNN

Nvidia TitanX Pascal	GEMM, Conv, RNN

Nvidia TitanXp	GEMM, Conv, RNN

Nvidia Tesla P100	GEMM, Conv, RNN	GEMM, Conv, RNN

Intel Xeon Phi 7250	GEMM, Conv

4.2. Server Deployment

The GEMM and convolution benchmark are run with 8 bit multiplication and 32 bit accumulate on NVIDIA processors. However, NVIDIA GPUs don't support all input sizes for this precision mode. Input sizes have to be a multiple of 4 to run in this precision mode. We have padded inputs dimensions to be multiples of 4 for all kernels. The cost of padding and discarding extra outputs is small compared to the cost of the operation. The results spreadsheet indicates which of the kernels required padding. Sparse operations and Recurrent kernel results are reported in single precision since the relevant libraries don't support low precision.


Processor	Single Precision	Int8 multiply/32 bit accumulate


Nvidia 1080Ti	RNN, Sparse GEMM	GEMM, Conv

Nvidia TitanX Pascal	RNN, Sparse GEMM	GEMM, Conv

Nvidia TitanXp	RNN, Sparse GEMM	GEMM, Conv

4.3. Device Deployment

The table below describes the inference device kernel results available on different processors, ops and precision. We don't have any results for RNNs since no ARM libraries support RNNs. ARM Compute library is not yet supported on the iPhone.


Processor	Single Precision	Int8 inputs/32 bit math


Raspberry Pi 3	Conv	GEMM, Sparse GEMM

iPhone 6		GEMM, Sparse GEMM

iPhone 7		GEMM, Sparse GEMM

5. Results

In this section, we are documenting the performance for a few operations. These are picked at random and are only meant to demonstrate the performance for a few applications. The results below only include the time and FLOPs for the fastest processor for the particular operation and parameters. The full results can be found in the repository.

Results on more hardware platforms or different precision will be added once they are available. We welcome contributions from all hardware vendors and library developers.

5.1. Training Results

Example results for GEMMs, Convolution, vanilla RNN and LSTM for training kernels are present in the first DeepBench release. The next section provides results for some example GRU kernels. In section 5.1.2, we document the performance of psuedo half precision GEMM kernels.

5.1.1. GRU Results

The recurrent op kernels are only run on NVIDIA hardware.


Hidden Units	Batch Size	TimeSteps	Application	Time (ms)	Fwd TeraFLOPS	Processor

2816	32	1500	Speech Recognition	1790.75	4.05	TitanX Pascal
2048	32	375	Speech Recognition	317.09	3.11	1080 Ti
1024	64	1500	Speaker ID	357.04	3.06	TitanXp

5.1.2. Float16 GEMM performance

In the results below, inputs and outputs are 16 bit but still use 32 bit compute.

For the GEMM kernels, M, N & K represent the matrix size. The sizes of the two matrices are MxK and KxN.


Kernel	Application	Results (ms)	TeraFLOPS	Processor

M=1760, N=128, K=1760	Speech Recognition	0.23	3.42	P100
M=7860, N=64, K=2560	Speech Recognition	0.61	4.15	P100
M=2560, N=64, K=2560	Speech Recognition	0.33	2.57	P100

5.2. Inference Server Results

The next few sections provide a few results for GEMM, Convolution and Recurrent operations for inference kernels on server platforms. Results on Intel platforms should be available shortly.

5.2.1. GEMM Results


Kernel	Application	Results (ms)	TeraFLOPS	Processor

M=5124, N=700, K=2048	Speech Recognition	0.46	31.94	1080 Ti
M=35, N=700, K=2048	Speech Recognition	0.05	2.09	1080 Ti
M=3072, N=3000, K=1024	Speech Recognition	0.49	38.36	Titan Xp
M=512, N=6000, K=2816	Speech Recognition	0.43	40.71	Titan Xp

5.2.2. Sparse GEMM Results


Kernel	Sparsity	Application	Results (ms)	Speedup wrt dense	TeraFLOPS	Processor

M=7680, N=1, K=2560	0.95	Speech Recognition	0.03	6.56	1.10	1080 Ti
M=7680, N=2, K=2560	0.95	Speech Recognition	0.04	5.93	1.74	1080 Ti
M=7680, N=1500, K=2560	0.95	Speech Recognition	29.81	0.16	1.88	TitanXp
M=10752, N=1, K=3584	0.9	Speech Recognition	0.1	4	0.72	TitanXp

5.2.3. Convolution Results


Input Size	Filter Size	# of Filters	Padding (h, w)	Stride (h, w)	Application	Time (ms)	TeraFLOPS	Processor

W = 341, H = 79, C = 32, N = 4	R = 5, S = 10	32	0,0	2,2	Speech Recognition	0.29	8.45	TitanXp
W = 224, H = 224, C = 3, N = 1	R = 7, S = 7	64	3, 3	2, 2	Computer Vision	0.14	1.64	TitanXp
W = 56, H = 56, C = 256, N = 1	R = 1, S = 1	128	0, 0	2, 2	Computer Vision	0.015	3.43	TitanX Pascal
W = 7, H = 7, C = 512, N = 2	R = 1, S = 1	2048	0, 0	1, 1	Computer Vision	0.018	11.42	1080 Ti

5.2.4. RNN Results


Hidden Units	Batch Size	TimeSteps	Recurrent Type	Application	Results (ms)	Fwd TeraFLOPS	Processor

1536	4	50	LSTM	Language Modelling	6.93	0.55	TitanXp
256	4	150	LSTM	Character Language Modelling	1.63	0.19	1080 Ti
2816	1	1500	GRU	Speech Recognition	350.62	0.20	TitanXp
2560	2	375	GRU	Speech Recognition	75.02	0.39	TitanXp

5.3. Inference Device Results

5.3.1. GEMM Results


Kernel	Application	Results (ms)	GigaFLOPS	Processor

M=5124, N=700, K=2048	Speech Recognition	212.84	69.03	iPhone 7
M=35, N=700, K=2048	Speech Recognition	1.94	51.69	iPhone 7
M=3072, N=1500, K=1024	Speech Recognition	136.63	69.07	iPhone 7

5.3.2. Sparse GEMM Results


Kernel	Sparsity	Application	Results (ms)	Speedup wrt dense	GigaFLOPS	Processor

M=7680, N=1, K=2560	0.95	Speech Recognition	1.01	15.55	18.55	iPhone 7
M=7680, N=1500, K=2560	0.95	Speech Recognition	1677.36	5.46	16.70	iPhone 7
M=7680, N=1, K=2560	0.9	Speech Recognition	2.1	8.02	8.41	iPhone 7

5.3.3. Convolution Results


Input Size	Filter Size	# of Filters	Padding (h, w)	Stride (h, w)	Application	Time (ms)	GigaFLOPS	Processor

W = 112, H = 112, C = 64, N = 1	R = 1, S = 1	64	0, 0	1, 1	Computer Vision	670.75	0.15	Raspberry Pi 3
W = 56, H = 56, C = 256, N = 1	R = 1, S = 1	128	0, 0	2, 2	Computer Vision	185.87	0.28	Raspberry Pi 3
W = 7, H = 7, C = 512, N = 1	R = 1, S = 1	2048	0, 0	1, 1	Computer Vision	735.28	0.14	Raspberry Pi 3

6. Community Involvement

We welcome contributions from the community to DeepBench. You can contribute in the following ways:

Deep Learning Researchers/Engineers: If you are deep learning researcher or engineer working on a new deep learning application, you may have different operations and/or workloads involved in training your model. We are interested in learning more about the underlying operations that are adversely impacting the performance (speed) of your model. Please contribute these operations and workloads!
Software developers: We would welcome contributions from linear algebra or nueral network library developers for the various inference server and device platforms.
Hardware Vendors: We would gladly accept contributions from other hardware vendors. We're open to accepting benchmark results from large companies or smaller startups building hardware for training deep learning models. Please contribute benchmark results for your hardware!

7. Acknowledgments

We would like to thank Mohammad Shoeybi, Kexin Zhao, Markus Kliegl, Thuan Nguyen & Julia Li from Baidu Research for their contributions to DeepBench and our partners at Intel, Nvidia & ARM for their guidance.

References

[1]Raziel Alvarez, Rohit Prabhavalkar, and Anton Bakhtin. “On the Efficient Representation and Execution of Deep Acoustic Models.” arXiv Preprint arXiv:1607.04683. 2016. 🔎

[2]Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, et al. “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.” arXiv Preprint arXiv:1512.02595. 2015. 🔎

[3]Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” arXiv Preprint arXiv:1406.1078. 2014. 🔎

[4]Song Han, Huizi Mao, and William J Dally. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” arXiv Preprint arXiv:1510.00149. 2015. 🔎

[5]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” arXiv Preprint arXiv:1512.03385. 2015. 🔎

[6]Sepp Hochreiter, and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation 9 (8). MIT Press: 1735–1780. 1997. 🔎

[7]Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations.” arXiv Preprint arXiv:1609.07061. 2016. 🔎

[8]Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. “Binarized Neural Networks.” In Advances in Neural Information Processing Systems, 4107–4115. 2016. 🔎

[9]Kyuyeon Hwang, and Wonyong Sung. “Fixed-Point Feedforward Deep Neural Network Design Using Weights+ 1, 0, and- 1.” In Signal Processing Systems (SiPS), 2014 IEEE Workshop on, 1–6. IEEE. 2014. 🔎

[10]Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. “An Empirical Exploration of Recurrent Network Architectures.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2342–2350. 2015. 🔎

[11]Xin Lei, Andrew W Senior, Alexander Gruenstein, and Jeffrey Sorensen. “Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices.” In Interspeech, volume 1. 2013. 🔎

[12]Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, et al. “Personalized Speech Recognition On Mobile Devices.” In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016. 🔎

[13]Sharan Narang, Gregory Diamos, Shubho Sengupta, and Erich Elsen. “Exploring Sparsity in Recurrent Neural Networks.” arXiv Preprint arXiv:1704.05119. 2017. 🔎

[14]Joachim Ott, Zhouhan Lin, Ying Zhang, Shih-Chii Liu, and Yoshua Bengio. “Recurrent Neural Networks with Limited Numerical Precision.” arXiv Preprint arXiv:1608.06902. 2016. 🔎

[15]Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. “Improving the Speed of Neural Networks on CPUs.” In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 1:4. 2011. 🔎