You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/01/08 21:28:11 UTC

[GitHub] [tvm] masahi commented on pull request #7233: [TOPI] Minor perf improvement for GPU scatter

masahi commented on pull request #7233:
URL: https://github.com/apache/tvm/pull/7233#issuecomment-757006403


   The second text block is an excerpt from the output of `nvprof --print-gpu-trace`, showing elapsed time, launch config etc of each kernel executed, in order.
   
   I don't have other benchmark other than the data from MaskRCNN. For the first kernel of 4D scatter, since it is just a memcpy, I don't see why we should do threading differently than other injective ops. I hope we don't need thorough benchmarking to justify this change. 
   
   > Would it be a better idea to have to separate scatter implementations (the parallel one and the sequential one) and let autotvm figure out which is better? Then we don't have to have all this special casing and magic input sizes.
   
   hmm, this sounds better than picking a random threshold, but do we have existing uses of autotvm to make such decision? Given that scatter kernels are extern, I'm not sure if autotvm can work with them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org