You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/08/07 22:36:47 UTC

[GitHub] [incubator-mxnet] ptrendx commented on pull request #18622: Use RTC for elementwise and broadcast ops

ptrendx commented on pull request #18622:
URL: https://github.com/apache/incubator-mxnet/pull/18622#issuecomment-670771199


   @eric-haibin-lin Yes. The overhead comes from preparing a string with kernel options (like the datatypes) and searching for the kernel function in cache. CUDA graph caches the resulting function so the lookup does not occur anymore.
   
   That said, this overhead is lower than the overhead of `cudaLaunchKernel` itself and is barely noticeable - I tried it with a worst case scenario of fully hybridized model that was adding tensors with single element (to be 100% CPU limited) and got ~10% slowdown. More realistic workload with kernels taking longer than a few us would not show any difference. The same CPU limited test with non-hybridized model did not show noticeable slowdown (overheads of imperative mode are way higher than this).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org