You are viewing a plain text version of this content. The canonical link for it is here.
Posted to discuss-archive@tvm.apache.org by Gasgallo via TVM Discuss <no...@discuss.tvm.ai> on 2020/05/17 11:27:21 UTC

[TVM Discuss] [Questions] CUDA -libs=cudnn performance


I'm comparing performance of a model when using:
- `mxnet-cu100` using CUDNN
- TVM CUDA `-libs=cudnn`

From my understanding the result should be basically the same but instead TVM is a lot slower. When compiling the model I see CUDNN log looking for best algorithm, so I think the setup is fine. After compiling the model I use `module.module.time_evaluator` to measure inference time and result is:
- 1666ms (vs 338ms of `mxnet-cu100`)

I also tried debug runtime already with following results:
```
Node Name                                         Ops                                               Time(us)     Time(%)  Shape               Inputs  Outputs  
---------                                         ---                                               --------     -------  -----               ------  -------  
fused_nn_conv2d_transpose_multiply_add_nn_relu    fused_nn_conv2d_transpose_multiply_add_nn_relu    1144440.0    68.686   (1, 256, 640, 480)  4       1        
fused_nn_conv2d_transpose_multiply_add_nn_relu_1  fused_nn_conv2d_transpose_multiply_add_nn_relu_1  255461.0     15.332   (1, 256, 320, 240)  4       1        
fused_nn_conv2d_transpose_multiply_add_nn_relu_3  fused_nn_conv2d_transpose_multiply_add_nn_relu_3  126896.0     7.616    (1, 256, 80, 60)    4       1        
fused_nn_conv2d_transpose_multiply_add_nn_relu_2  fused_nn_conv2d_transpose_multiply_add_nn_relu_2  58891.6      3.535    (1, 256, 160, 120)  4       1        
fused_nn_conv2d_add_nn_relu                       fused_nn_conv2d_add_nn_relu                       27347.4      1.641    (1, 256, 640, 480)  3       1        
fused_nn_conv2d_add                               fused_nn_conv2d_add                               4067.5       0.244    (1, 64, 320, 240)   3       1        
fused_nn_conv2d_add_nn_relu_1                     fused_nn_conv2d_add_nn_relu_1                     3312.35      0.199    (1, 512, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_12                    fused_nn_conv2d_add_nn_relu_1                     3256.99      0.195    (1, 512, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_11                    fused_nn_conv2d_add_nn_relu_1                     3228.79      0.194    (1, 512, 40, 30)    3       1        
fused_nn_conv2d_add_4                             fused_nn_conv2d_add_4                             2783.87      0.167    (1, 2048, 40, 30)   3       1        
fused_nn_conv2d_add_add_nn_relu1                  fused_nn_conv2d_add_add_nn_relu                   1652.69      0.099    (1, 2048, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu2                  fused_nn_conv2d_add_add_nn_relu                   1651.54      0.099    (1, 2048, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu                   fused_nn_conv2d_add_add_nn_relu                   1650.41      0.099    (1, 2048, 40, 30)   4       1        
fused_nn_conv2d_add_nn_relu_2                     fused_nn_conv2d_add_nn_relu_2                     1437.18      0.086    (1, 512, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_21                    fused_nn_conv2d_add_nn_relu_2                     1421.56      0.085    (1, 512, 40, 30)    3       1        
fused_nn_conv2d_add_add_nn_relu_31                fused_nn_conv2d_add_add_nn_relu_3                 1101.8       0.066    (1, 256, 160, 120)  4       1        
fused_nn_conv2d_add_add_nn_relu_3                 fused_nn_conv2d_add_add_nn_relu_3                 1101.62      0.066    (1, 256, 160, 120)  4       1        
fused_nn_conv2d_add_add_nn_relu_32                fused_nn_conv2d_add_add_nn_relu_3                 1094.64      0.066    (1, 256, 160, 120)  4       1        
fused_nn_conv2d_add_2                             fused_nn_conv2d_add_2                             932.318      0.056    (1, 512, 80, 60)    3       1        
fused_nn_conv2d_add_1                             fused_nn_conv2d_add_1                             910.287      0.055    (1, 256, 160, 120)  3       1        
fused_nn_conv2d_add_nn_relu_111                   fused_nn_conv2d_add_nn_relu_11                    902.002      0.054    (1, 128, 160, 120)  3       1        
fused_nn_conv2d_add_nn_relu_6                     fused_nn_conv2d_add_nn_relu_6                     881.889      0.053    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_44                    fused_nn_conv2d_add_nn_relu_4                     877.233      0.053    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_41                    fused_nn_conv2d_add_nn_relu_4                     877.187      0.053    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_42                    fused_nn_conv2d_add_nn_relu_4                     875.494      0.053    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_43                    fused_nn_conv2d_add_nn_relu_4                     875.401      0.053    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_4                     fused_nn_conv2d_add_nn_relu_4                     874.796      0.053    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_10                    fused_nn_conv2d_add_nn_relu_10                    846.727      0.051    (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_3                             fused_nn_conv2d_add_3                             826.964      0.05     (1, 1024, 40, 30)   3       1        
fused_nn_conv2d_add_nn_relu_7                     fused_nn_conv2d_add_nn_relu_7                     777.48       0.047    (1, 256, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_3                     fused_nn_conv2d_add_nn_relu_3                     747.369      0.045    (1, 512, 40, 30)    3       1        
fused_nn_conv2d_add_add_nn_relu_2                 fused_nn_conv2d_add_add_nn_relu_2                 702.484      0.042    (1, 512, 80, 60)    4       1        
fused_nn_conv2d_add_add_nn_relu_22                fused_nn_conv2d_add_add_nn_relu_2                 701.242      0.042    (1, 512, 80, 60)    4       1        
fused_nn_conv2d_add_add_nn_relu_21                fused_nn_conv2d_add_add_nn_relu_2                 700.993      0.042    (1, 512, 80, 60)    4       1        
fused_nn_conv2d_add_add_nn_relu_23                fused_nn_conv2d_add_add_nn_relu_2                 697.114      0.042    (1, 512, 80, 60)    4       1        
fused_nn_conv2d_add_nn_relu_123                   fused_nn_conv2d_add_nn_relu_12                    610.56       0.037    (1, 64, 160, 120)   3       1        
fused_nn_conv2d_add_nn_relu_122                   fused_nn_conv2d_add_nn_relu_12                    596.89       0.036    (1, 64, 160, 120)   3       1        
fused_nn_conv2d_add_nn_relu_121                   fused_nn_conv2d_add_nn_relu_12                    594.063      0.036    (1, 64, 160, 120)   3       1        
fused_nn_conv2d_add_add_nn_relu_1                 fused_nn_conv2d_add_add_nn_relu_1                 529.202      0.032    (1, 1024, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu_12                fused_nn_conv2d_add_add_nn_relu_1                 529.186      0.032    (1, 1024, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu_14                fused_nn_conv2d_add_add_nn_relu_1                 528.849      0.032    (1, 1024, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu_13                fused_nn_conv2d_add_add_nn_relu_1                 528.555      0.032    (1, 1024, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu_15                fused_nn_conv2d_add_add_nn_relu_1                 528.112      0.032    (1, 1024, 40, 30)   4       1        
fused_nn_conv2d_add_add_nn_relu_11                fused_nn_conv2d_add_add_nn_relu_1                 527.938      0.032    (1, 1024, 40, 30)   4       1        
fused_nn_conv2d_add_nn_relu_131                   fused_nn_conv2d_add_nn_relu_13                    514.069      0.031    (1, 64, 160, 120)   3       1        
fused_nn_conv2d_add_nn_relu_13                    fused_nn_conv2d_add_nn_relu_13                    514.053      0.031    (1, 64, 160, 120)   3       1        
fused_nn_conv2d_add_nn_relu_8                     fused_nn_conv2d_add_nn_relu_8                     508.776      0.031    (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_82                    fused_nn_conv2d_add_nn_relu_8                     506.434      0.03     (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_81                    fused_nn_conv2d_add_nn_relu_8                     503.658      0.03     (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_92                    fused_nn_conv2d_add_nn_relu_9                     421.353      0.025    (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_9                     fused_nn_conv2d_add_nn_relu_9                     418.987      0.025    (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_91                    fused_nn_conv2d_add_nn_relu_9                     417.537      0.025    (1, 128, 80, 60)    3       1        
fused_nn_conv2d_add_nn_relu_54                    fused_nn_conv2d_add_nn_relu_5                     416.805      0.025    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_51                    fused_nn_conv2d_add_nn_relu_5                     416.662      0.025    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_53                    fused_nn_conv2d_add_nn_relu_5                     415.586      0.025    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_5                     fused_nn_conv2d_add_nn_relu_5                     415.565      0.025    (1, 256, 40, 30)    3       1        
fused_nn_conv2d_add_nn_relu_52                    fused_nn_conv2d_add_nn_relu_5                     415.012      0.025    (1, 256, 40, 30)    3       1        
fused_nn_max_pool2d_nn_relu                       fused_nn_max_pool2d_nn_relu                       262.382      0.016    (1, 64, 160, 120)   1       1        
fused_nn_conv2d_add_nn_relu_14                    fused_nn_conv2d_add_nn_relu_14                    260.942      0.016    (1, 64, 160, 120)   3       1        
Total_time                                        -                                                 1666185.096  -        -                   -       -        
23.942410945892334
```

If I remove the `conv2d_transpose` layers from the model then performance are the following:
- TVM CUDA autotuned: 40ms
- `mxnet-cu100`: 60ms

It's clear that implementation of `conv2d_transpose` is the bottleneck for TVM, but why cannot I reproduce the same performance when specifying `-libs=cudnn`?





---
[Visit Topic](https://discuss.tvm.ai/t/cuda-libs-cudnn-performance/6700/1) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/6a15aba1a057a1a65e8d6d3b76aab0d232ca44bf8c61e358af41f4ddd20f13f6).

[TVM Discuss] [Questions] CUDA -libs=cudnn performance

Posted by Gasgallo via TVM Discuss <no...@discuss.tvm.ai>.

Yes, thank you! Which ops do we offload to CUDNN then?





---
[Visit Topic](https://discuss.tvm.ai/t/cuda-libs-cudnn-performance/6700/3) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/bb431d65160041366a7f3aa4174fb392d95558f8543330da4d008d6861f7799f).

[TVM Discuss] [Questions] CUDA -libs=cudnn performance

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.

conv_transpose won't be run on cudnn even if you specify `-libs=cudnn`. Does this answer your question?





---
[Visit Topic](https://discuss.tvm.ai/t/cuda-libs-cudnn-performance/6700/2) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/731282332ddce8b2c1ed6e693d817fc1b62760dab4b352bce6e22f9d8b960d29).

[TVM Discuss] [Questions] CUDA -libs=cudnn performance

Posted by masahi via TVM Discuss <no...@discuss.tvm.ai>.

conv2d, conv3d and softmax.





---
[Visit Topic](https://discuss.tvm.ai/t/cuda-libs-cudnn-performance/6700/4) to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/03c8b3ed49200976017f28088de15c04b980741bce9e96ba5b792299ce37e07d).