You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/04/12 00:57:04 UTC

[GitHub] [incubator-tvm] roastduck commented on issue #5307: [TIR] Make lower_warp_memory support extent(threadIdx.x) < warp_size

roastduck commented on issue #5307: [TIR] Make lower_warp_memory support extent(threadIdx.x) < warp_size
URL: https://github.com/apache/incubator-tvm/pull/5307#issuecomment-612545516
 
 
   > Thanks @roastduck. I wonder if we can also discuss the alternative abstractions. Right now the abstraction seems to suggest that conceptually the size of the warp is reduced to half(as the shuffle size). However, another way to view it would be to keep the size of the warp to be fixed(32), but support the index access pattern of the subgroups, for example, the canonical form below describes a shuffle in the group of 4
   > 
   > ```c++
   > A[wi] = B[(wi/4)*4+ ((wi % 4) +1) %4]
   > ```
   
   In the alternative approach, `__shfl(x, (threadIdx.x + 1) % 4, 4)` becomes `__shfl(x, threadIdx.y * 4 + (threadIdx.x + 1) % 4)` (and `threadIdx.z` might also be involved). Is my understanding right?
   
   I think the good thing is better compatibility for OpenCL. By this approach, we can support OpenCL using the old 2-parameter intrinsic.
   
   And I think the bad thing comes with CUDA's new shuffle API. `__shfl` has actually been [deprecated](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions) by CUDA, and we will have to switch to the new `__shfl_sync` API. The new API requires an explicit argument `mask` to specify which threads are active during this shuffle. In the approach of this PR, we will only need to calculate the activeness within a partial (say 1/2, 1/4, etc) warp, which can be calculated from the `if` nest, given the thread indices of the current partial warp. But in the alternative approach, we will need to calculate the activeness within a whole warp, which means other threads outside the current partial warp will be involved. It may bring a lot of complexity, and even run time overhead when there are dynamic conditions.
   
   To better support both CUDA and OpenCL, maybe we can use both of the approaches.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services