You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/29 16:29:47 UTC

[GitHub] [incubator-mxnet] access2rohit commented on pull request #18168: Separate GPU kernel for broadcast_axis

access2rohit commented on pull request #18168:
URL: https://github.com/apache/incubator-mxnet/pull/18168#issuecomment-621322233


   > > > Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.
   > > 
   > > 
   > > @apeforest
   > > 
   > > * We cannot use typedef based on xpu because typdef is compile time option.
   > > * The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.
   > > 
   > > Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.
   > > Again typedef is compile time and cannot be dynamically decided.
   > > Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !
   > 
   > I did not mean `typedef` the only way to do so. Can we leverage the template feature. In fact, the Map function in GPU kernel is always using `int` instead of `index_t`
   
   @apeforest : kernels can be fused together using templates. But this approach hinders CPU performance but there is another PR that actually boosts it for CPU: https://github.com/apache/incubator-mxnet/pull/17882 . Since here I am leveraging vectorization.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org