DeepBench updates with a focus on deep learning inference
Date: June 28, 2017
Authors: Sharan Narang, Greg Diamos
In September 2016, we released DeepBench, an open source benchmarking tool that measures the performance of basic operations involved in training deep learning networks. The benchmark included results on several different processors used for training. We’ve extended DeepBench to include support for deep learning inference. Inference changes include using low precision kernels, modifying the sizes of kernels and benchmarking sparse operations. In addition, the training benchmark now has support for Gated Recurrent Units (GRU) and low precision training.
DeepBench: Open-Source Tool for benchmarking DL operations
Date: September 26, 2016
Authors: Sharan Narang
DeepBench is the first open source benchmarking tool for evaluation the performance of deep learning operations on different hardware platforms. DeepBench also includes a list of operations and workloads that are important to the performance of deep learning training. We’ve measured performance of matrix multiplies, convolutions, Recurrent ops (vanilla RNNs and LSTMs) and all-reduce for different sizes and parameters on Nvidia and Intel processors. We welcome contributions from the deep learning community to add to the list of existing operations and from hardware vendors who would like provide benchmark
Optimizing RNNs with Differentiable Graphs
Part II: Optimizing RNN performance
Date: June 14th, 2016
Authors: Jesse Engel
Differentiable graph notation provides an easy way to visually infer the gradients for complex neural networks. We also show several useful rules of thumb for optimizing graphs of new algorithms.
Persistent RNNs: 30 times faster RNN layers at small mini-batch sizes
Date: March 25th, 2016
Authors: Greg Diamos
YouTube: SVAIL Tech Notes: Accelerating RNNs by Stashing Weights On-Chip
At SVAIL, our mission is to create AI technology that lets us have a significant impact on hundreds of millions of people. We believe that a good way to do this is to improve the accuracy of speech recognition by scaling up deep learning algorithms on larger datasets than what has been done in the past.
Around the World in 60 Days: Getting Deep Speech to Work on Mandarin
Date: February 9th, 2016
Authors: Tony Han, Ryan Prenger
YouTube: SVAIL Tech Notes: Recognizing both English and Mandarin
In our recent paper Deep Speech 2, we showed our results in Mandarin. In just a few months, we had produced a Mandarin speech recognition system with a recognition rate better than native Mandarin speakers. Here we want to discuss what we did to adapt the system to Mandarin and how the end-to-end learning approach made the whole project easier.
Fast Open Source CPU/GPU Implementation of CTC
Date: January 14th, 2016
Contact: svail-questions@baidu.com
YouTube: SVAIL Tech Notes: Warp CTC
Warp-CTC from Baidu Research's Silicon Valley AI Lab is a fast parallel implementation of CTC, on both CPU and GPU. Warp-CTC can be used to solve supervised problems that map an input sequence to an output sequence, such as speech recognition. To get Warp-CTC follow the link above. If you are interested in integrating Warp-CTC into a machine learning framework reach out to us. We are happy to accept pull requests.
Investigating performance of GPU BLAS Libraries
Part I: Optimizing RNN performance
Date: November 17th, 2015
Author: Erich Elsen
Most researchers engaging in Neural Network research have been using GPUs for training for some time now due to
the speed advantage they have over CPUs. GPUs from NVIDIA are almost universally preferred because they come with
high quality BLAS (cuBLAS) and convolution (cuDNN) libraries.