You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2022/02/01 13:21:49 UTC

[GitHub] [tvm] masahi commented on pull request #10110: [CUTLASS] Conv2d dgrad

masahi commented on pull request #10110:
URL: https://github.com/apache/tvm/pull/10110#issuecomment-1026837332


   HUGE UPDATE: Thanks to a tip from @manishucsd and @hwu36, it turns out upgrading the CUDA version from 11.3 to 11.6 alone gives 2x speedup on cutlass strided dgrad (unreal). Moreover, there was a critical bug in the parameter `beta` initialization, which was causing unnecessary memory traffic. That was hurting a lot on batch 256 case. The result was still correct because the `C` tensor, which points to the output pointer, was initialized with zeros.
   
   Here are the updated results after these two fixes:
   
   * [Batch size 8](https://gist.github.com/masahi/90f68803ade90a5900029b128eb59dcf) 
   * [Batch size 256](https://gist.github.com/masahi/a373b06e76e9d228d45ffdddc1cd9f76)
   
   Now, cutlass is winning in ALL but one case in batch size 256, which is still 0.96 vs 0.94 difference. Note that activation fusion is not enabled for dgrad yet. So I expect the cutlass perf to be much better in practice for DL training use cases.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org