You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/26 00:41:47 UTC
[GitHub] lihaofd opened a new pull request #11399: [WIP] Add Fused Vanilla RNN and dropout

lihaofd opened a new pull request #11399: [WIP] Add Fused Vanilla RNN and dropout
URL: https://github.com/apache/incubator-mxnet/pull/11399
 
 
   ## Description ##
   In this PR, it creates Fused Vanilla RNN(tanh/relu) operator and dropout of GRU/LSTM/vRNN for CPU.
   @pengzhao-intel, @TaoLv 
   
   ## Feature changes ##
   ### New features ###
   - Single layer/Multiple layer and unidirectional/bidirectional Vanilla RNN(tanh/relu), including both forward and backward computation.
   - Support dropout of GRU/LSTM/vRNN
   
   ### Unit-test changes ###
   - Create new testcase in tests/python/unittests/test_operator.py.
   - update testcase in example/rnn/bucketing/cudnn_rnn_bucketing.py
   - Check consistency with original RNNCell implementation.
   
   ### Performance ###
   We have tested performance of FusedRNN and NonFused RNNCell on our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.
   Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).
   
   Layer=1 bidirectional = False
   
   | API | Inference time(fwd, samples/sec)  |  Training time(fwd + bwd, samples/sec)           |
   | --------           | :-----:                              | :----:                                   |
   | rnn.RNNCell - NoFusedRNN(Tanh, CPU)     | 492.61                       |  198.02                            |
   | this PR - FusedRNN(Tanh, CPU)              | 952.38                      |  318.98                             |
   | speedup         | 1.93x                              |  1.61x                                    | 
   
   
   | API | Inference time(fwd, samples/sec)  |  Training time(fwd + bwd, samples/sec)           |
   | --------           | :-----:                              | :----:                                   |
   | rnn.RNNCell - NoFusedRNN(Relu, CPU)     | 277.78                       |  104.17                            |
   | this PR - FusedRNN(Relu, CPU)              | 740.74                      |  177                             |
   | speedup         | 2.67x                              |  1.7x                                    | 
   
   Layer=5 bidirectional = True
   
   | API | Inference time(fwd, samples/sec)  |  Training time(fwd + bwd, samples/sec)           |
   | --------           | :-----:                              | :----:                                   |
   | rnn.RNNCell - NoFusedRNN(Tanh, CPU) | 38.91                       |  22.73                            |
   | rnn.RNNCell (Tanh, cuda)  | 47.85                       |  26.95                            |
   | rnn.RNNCell (Tanh, cudnn)  | 208.33                       |  81.63                           |
   | this PR - FusedRNN(Tanh, CPU)              | 104.17                      |  34.01                             |
   | speedup -this PR/RNNCell (Tanh, CPU)          | 267.7%                              |  149.7%                                    | 
   | speedup -this PR/RNNCell  (Tanh, cuda)        | 217.7%                              |  126.2%                                    | 
   | speedup -this PR/RNNCell  (Tanh, cudnn)        | 50%                              |  41.7%                                   | 
   
   
   | API | Inference time(fwd, samples/sec)  |  Training time(fwd + bwd, samples/sec)           |
   | --------           | :-----:                              | :----:                                   |
   | rnn.RNNCell - NoFusedRNN(Relu, CPU) | 40.73                       |  22.6                            |
   | rnn.RNNCell (Relu, cuda)  | 52.91                       |  26.81                            |
   | rnn.RNNCell (Relu, cudnn)  | 206.83                       |  82.64                          |
   | this PR - FusedRNN(Relu, CPU)              | 134.23                      |  35.97                             |
   | speedup -this PR/RNNCell (Relu, CPU)          | 329.5%                              |  159.2%                                    | 
   | speedup -this PR/RNNCell  (Relu, cuda)        | 253.7%                              |  134.2%                                    | 
   | speedup -this PR/RNNCell  (Relu, cudnn)        | 64.9%                              |  43.5%                                   | 
   
   ### Convergency Curve ###
   We have tested Convergency of FusedGRU/LSTM(dropout = 0.5) on our CPU-Skylake-8180 with 2 Sockets and 56 cores and GPU-P100  by using example/rnn/bucketing/cudnn_rnn_bucketing.py
   Test input size is layer = 3, batch_size = 32, num-embed = 800, num-hidden = 800, num-epochs 20
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services